HuggingFace Hub Scraper avatar

HuggingFace Hub Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
HuggingFace Hub Scraper

HuggingFace Hub Scraper

Scrape Hugging Face Hub, search and fetch models, datasets, and spaces with full metadata: downloads, likes, license, pipeline tag, library, tags, files, and more. Pure HTTP, no auth required.

Pricing

from $3.00 / 1,000 results

Rating

5.0

(6)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

6

Bookmarked

2

Total users

1

Monthly active users

19 hours ago

Last modified

Share

Scrape the Hugging Face Hub — 1M+ machine-learning models, 200K+ datasets, 400K+ Spaces, and millions of user profiles. Search by query, fetch by repo ID or URL, list trending repos, or pull a user's overview. Pure HTTP via the official public Hub API at huggingface.co/api/*. No auth, no proxy, no cookies.

What this actor does

  • 7 modes: search, byModel, byDataset, bySpace, byUser, trending, byUrl
  • Three entity catalogs: models, datasets, spaces (search & trending pivot on entityType)
  • Filters: pipeline tag, library, license, language, author/org, min downloads, min likes
  • Server-side sort: trending score, downloads, likes, last modified, created at
  • URL auto-detection: paste any huggingface.co/<repo> or /datasets/<id> or /spaces/<id> or /users/<u> URL — the actor figures out the kind
  • Optional ?full=true: include sibling files, cardData, config metadata
  • Empty fields are omitted — every record only contains populated fields

Output

The actor emits a flat record per repo / user. Fields you might see (omit-empty applies):

Common

  • recordTypemodel / dataset / space / user
  • repoId — full Hub identifier (e.g. google-bert/bert-base-uncased)
  • owner — author / organization slug
  • sha, createdAt, lastModified
  • downloads, likes, trendingScore
  • license, tags, languages
  • scrapedAt

Model-only

  • modelName, modelType, architectures[]
  • pipelineTag, libraryName
  • trainedOnDatasets[], arxivIds[]
  • maskToken, fileCount, files[]
  • modelUrl

Dataset-only

  • datasetName, description
  • taskCategories[], taskIds[]
  • modalities[], formats[], sizeCategories[]
  • paperswithcodeId, fileCount, files[]
  • datasetUrl

Space-only

  • spaceName, sdk, runtimeStage
  • title, emoji, host, subdomain
  • fileCount, files[]
  • spaceUrl

User-only

  • username, fullName, avatarUrl, isPro
  • numModels, numDatasets, numSpaces
  • numFollowers, numFollowing, numLikes, numUpvotes
  • numPapers, numDiscussions
  • orgs[], profileUrl

Input

FieldTypeDefaultDescription
modeenumsearchOne of the 7 modes
entityTypeenummodelsmodels / datasets / spaces (mode=search/trending)
searchQuerystringbertFree-text query
repoIdsarrayRepo IDs or URLs (mode=byModel/byDataset/bySpace)
usernamestringUsername (mode=byUser)
startUrlsarrayHub URLs (mode=byUrl) — kind auto-detected
pipelineTagenumFilter models by task tag (43 options)
libraryNameenumFilter models by library (17 options)
authorFilterstringConstrain to org/author slug
licenseenumLicense filter (29 options)
languagestring2-letter language code
sortenumtrendingScore / downloads / likes / lastModified / createdAt
directionenumdescdesc or asc
minDownloadsintegerDrop records below this download count
minLikesintegerDrop records below this like count
includeFullDetailsbooleanfalsePass ?full=true for siblings/config
maxItemsinteger50Hard cap (1–10000)

Examples

Search top BERT models

{
"mode": "search",
"entityType": "models",
"searchQuery": "bert",
"sort": "downloads",
"maxItems": 50
}
{
"mode": "trending",
"entityType": "models",
"pipelineTag": "text-generation",
"maxItems": 25
}

Lookup a specific dataset

{
"mode": "byDataset",
"repoIds": ["rajpurkar/squad_v2"]
}

Lookup by URL (auto-detect)

{
"mode": "byUrl",
"startUrls": [
"https://huggingface.co/google-bert/bert-base-uncased",
"https://huggingface.co/datasets/squad",
"https://huggingface.co/spaces/lmarena-ai/chatbot-arena"
]
}

User profile

{
"mode": "byUser",
"username": "julien-c"
}

Reliability

  • Direct calls to the official huggingface.co/api/* endpoints
  • Exponential backoff retries on 429, 500504
  • Page size capped at 100 (the API hard cap); paginated via ?skip=N&limit=N
  • No proxy needed — works from datacenter IPs
  • No cookies / API token required for read access

Limitations

  • Private repos require a user access token; this actor only exposes the public read API.
  • The language and license filters are forwarded to the Hub API and applied server-side. The Hub's filtering is best-effort: some matching repos lack a language:<code> / license:<id> tag (the metadata lives in the model card). If you need strict tag-based filtering, post-filter the dataset on languages[] / license.
  • The ?full=true flag is rate-limited harder by the upstream; expect slower runs when enabled at large maxItems.
  • Single-segment legacy repo IDs (e.g. bert-base-uncased) are auto-resolved by the API to their canonical owner-prefixed form (e.g. google-bert/bert-base-uncased).

FAQ

Do I need a Hugging Face account / API token? No. The Hub's read API is public.

How fresh is the data? Real-time — every run hits the live API.

Can I download model weights? No. This actor exposes Hub metadata — repo info, files list, license, tags, etc. To download weights, use the huggingface_hub Python library with the repoId from this actor's output.

Why are some fields missing? Empty / null fields are omitted — only populated fields appear in the output.

Why does my license filter return fewer results than expected? Many repos don't tag their license. Records without a license:* tag are excluded when the license filter is set.