HuggingFace Hub Scraper
Pricing
from $3.00 / 1,000 results
HuggingFace Hub Scraper
Scrape Hugging Face Hub, search and fetch models, datasets, and spaces with full metadata: downloads, likes, license, pipeline tag, library, tags, files, and more. Pure HTTP, no auth required.
Pricing
from $3.00 / 1,000 results
Rating
5.0
(6)
Developer
Crawler Bros
Maintained by CommunityActor stats
6
Bookmarked
2
Total users
1
Monthly active users
19 hours ago
Last modified
Categories
Share
Scrape the Hugging Face Hub — 1M+ machine-learning models, 200K+ datasets, 400K+ Spaces, and millions of user profiles. Search by query, fetch by repo ID or URL, list trending repos, or pull a user's overview. Pure HTTP via the official public Hub API at huggingface.co/api/*. No auth, no proxy, no cookies.
What this actor does
- 7 modes:
search,byModel,byDataset,bySpace,byUser,trending,byUrl - Three entity catalogs: models, datasets, spaces (search & trending pivot on
entityType) - Filters: pipeline tag, library, license, language, author/org, min downloads, min likes
- Server-side sort: trending score, downloads, likes, last modified, created at
- URL auto-detection: paste any
huggingface.co/<repo>or/datasets/<id>or/spaces/<id>or/users/<u>URL — the actor figures out the kind - Optional
?full=true: include sibling files, cardData, config metadata - Empty fields are omitted — every record only contains populated fields
Output
The actor emits a flat record per repo / user. Fields you might see (omit-empty applies):
Common
recordType—model/dataset/space/userrepoId— full Hub identifier (e.g.google-bert/bert-base-uncased)owner— author / organization slugsha,createdAt,lastModifieddownloads,likes,trendingScorelicense,tags,languagesscrapedAt
Model-only
modelName,modelType,architectures[]pipelineTag,libraryNametrainedOnDatasets[],arxivIds[]maskToken,fileCount,files[]modelUrl
Dataset-only
datasetName,descriptiontaskCategories[],taskIds[]modalities[],formats[],sizeCategories[]paperswithcodeId,fileCount,files[]datasetUrl
Space-only
spaceName,sdk,runtimeStagetitle,emoji,host,subdomainfileCount,files[]spaceUrl
User-only
username,fullName,avatarUrl,isPronumModels,numDatasets,numSpacesnumFollowers,numFollowing,numLikes,numUpvotesnumPapers,numDiscussionsorgs[],profileUrl
Input
| Field | Type | Default | Description |
|---|---|---|---|
mode | enum | search | One of the 7 modes |
entityType | enum | models | models / datasets / spaces (mode=search/trending) |
searchQuery | string | bert | Free-text query |
repoIds | array | – | Repo IDs or URLs (mode=byModel/byDataset/bySpace) |
username | string | – | Username (mode=byUser) |
startUrls | array | – | Hub URLs (mode=byUrl) — kind auto-detected |
pipelineTag | enum | – | Filter models by task tag (43 options) |
libraryName | enum | – | Filter models by library (17 options) |
authorFilter | string | – | Constrain to org/author slug |
license | enum | – | License filter (29 options) |
language | string | – | 2-letter language code |
sort | enum | – | trendingScore / downloads / likes / lastModified / createdAt |
direction | enum | desc | desc or asc |
minDownloads | integer | – | Drop records below this download count |
minLikes | integer | – | Drop records below this like count |
includeFullDetails | boolean | false | Pass ?full=true for siblings/config |
maxItems | integer | 50 | Hard cap (1–10000) |
Examples
Search top BERT models
{"mode": "search","entityType": "models","searchQuery": "bert","sort": "downloads","maxItems": 50}
Trending text-generation models
{"mode": "trending","entityType": "models","pipelineTag": "text-generation","maxItems": 25}
Lookup a specific dataset
{"mode": "byDataset","repoIds": ["rajpurkar/squad_v2"]}
Lookup by URL (auto-detect)
{"mode": "byUrl","startUrls": ["https://huggingface.co/google-bert/bert-base-uncased","https://huggingface.co/datasets/squad","https://huggingface.co/spaces/lmarena-ai/chatbot-arena"]}
User profile
{"mode": "byUser","username": "julien-c"}
Reliability
- Direct calls to the official
huggingface.co/api/*endpoints - Exponential backoff retries on
429,500–504 - Page size capped at 100 (the API hard cap); paginated via
?skip=N&limit=N - No proxy needed — works from datacenter IPs
- No cookies / API token required for read access
Limitations
- Private repos require a user access token; this actor only exposes the public read API.
- The
languageandlicensefilters are forwarded to the Hub API and applied server-side. The Hub's filtering is best-effort: some matching repos lack alanguage:<code>/license:<id>tag (the metadata lives in the model card). If you need strict tag-based filtering, post-filter the dataset onlanguages[]/license. - The
?full=trueflag is rate-limited harder by the upstream; expect slower runs when enabled at largemaxItems. - Single-segment legacy repo IDs (e.g.
bert-base-uncased) are auto-resolved by the API to their canonical owner-prefixed form (e.g.google-bert/bert-base-uncased).
FAQ
Do I need a Hugging Face account / API token? No. The Hub's read API is public.
How fresh is the data? Real-time — every run hits the live API.
Can I download model weights? No. This actor exposes Hub metadata — repo info, files list, license, tags, etc. To download weights, use the huggingface_hub Python library with the repoId from this actor's output.
Why are some fields missing? Empty / null fields are omitted — only populated fields appear in the output.
Why does my license filter return fewer results than expected? Many repos don't tag their license. Records without a license:* tag are excluded when the license filter is set.