HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors
Pricing
from $2.50 / 1,000 huggingface records
HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors
Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.
Pricing
from $2.50 / 1,000 huggingface records
Rating
0.0
(0)
Developer
deusex machine
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
HuggingFace Hub Scraper β Models, Datasets, Spaces, Papers & Authors
Scrape the HuggingFace Hub with 30+ fields per record. Use this HuggingFace scraper as a no-auth, no-rate-limit alternative to the official HuggingFace Hub API: search models, datasets, spaces and daily papers, filter by author, task, library, language, license, downloads or likes, parse the flat tag array into structured columns, and export everything to CSV, JSON, Excel or a queryable database.
If you have tried building anything on top of the HuggingFace Hub API you already know the friction: paginated REST endpoints with inconsistent shapes, tags packed into a flat string array, no first-class downloads-stats endpoint, and author profiles that require yet another HTTP call. This actor unifies all of that into one schema, runs entirely against the public HF endpoints (no HF_TOKEN, no API key) and adds an optional web-enrichment step for outreach.
π‘ Looking for HuggingFace data, an HF model finder, an HF dataset list, or a way to convert a HuggingFace dataset to CSV? This is the actor. It supports the four main HF resources β models, datasets, spaces, papers β plus bulk lookup by ID.
π Why this HuggingFace scraper
- 30+ structured fields per record β
id,author,pipelineTag,library,parameters,usedStorageBytes,inferenceStatus,gated,widgetData,spacesUsing,siblings,config,cardData,arxivPapers,datasetsUsed, and more - Structured tag parsing β the flat
tagsarray gets split intolicense,languages,datasetsUsed,arxivPapers,region,hardwareCompatible,frameworks - Author / organization profile β followers,
isPro,numModels,numDatasets,numSpaces,numPapers, list of organizations - Web enrichment β for every unique author, find their personal or company website, LinkedIn, Facebook and secondary emails via a SERP fetcher (no API key)
- 5 modes β Models, Datasets, Spaces, Daily Papers, plus bulk Lookup-by-IDs
- Filters that actually work β author / organization, pipeline tag (task), library, language, license, SDK (for spaces), minimum downloads, minimum likes
- Sorting β trending score (fresh hype), downloads, likes, last modified, created at
- Outputs β Apify Dataset β CSV, JSON, Excel, XML, RSS, HTML
Built for AI researchers, ML platform teams, dev-tools founders building on top of HuggingFace, recruiters sourcing ML talent, VCs mapping the foundation-model landscape, and DevRel teams running outreach to model authors.
π What this HuggingFace Hub Scraper extracts
| Field | Description |
|---|---|
id | Full ID (author/name) |
name | Short name without author prefix |
author | Author or organization handle (e.g. meta-llama, mistralai) |
type | model / dataset / space / paper |
pipelineTag | Primary task (text-generation, image-classification, β¦) |
library | Library (transformers, diffusers, sentence-transformers, gguf, β¦) |
tags | Raw flat tags array (HF format) |
tagsStructured | Parsed object: { license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks } |
downloads | Total downloads (lifetime) |
downloadsAllTime | Same as above; alias for compatibility |
likes | Total likes |
trendingScore | HF's own trending signal |
createdAt | Creation timestamp |
lastModified | Last modification timestamp |
private | Whether the record is private (always false for public scrape) |
gated | Whether the model is gated (requires acceptance) |
disabled | Whether the record is disabled |
inferenceStatus | Inference API availability (live, loading, error) |
parameters | Number of parameters (when published in safetensors / config.json) |
usedStorageBytes | Model artifact size on disk |
widgetData | Widget config (when present) |
spacesUsing | Count of Spaces referencing this model |
siblings | File list inside the repo (name + size) |
config | Parsed config.json (model architecture, vocab size, hidden size, β¦) |
cardData | Parsed YAML front-matter of the README ("model card") |
arxivPapers | Array of arXiv IDs declared in tags or cardData |
datasetsUsed | Array of datasets declared as training sources |
frameworks | Frameworks (pytorch, tensorflow, jax, ggml, β¦) |
license | Top-level license from cardData |
languages | ISO codes (en, es, multilingual, β¦) |
authorProfile | { followers, isPro, numModels, numDatasets, numSpaces, numPapers, orgs[] } |
enrichment | { website, linkedin, facebook, twitter, emails[] } |
url | Canonical huggingface.co/... URL |
For Spaces, the schema adds sdk (docker / gradio / streamlit / static), runtime, models and datasets declared in README.md. For Papers, the schema adds title, summary, arxivId, upvotes, commentsCount, submittedBy.
π― Search modes
1. models β HuggingFace model search
Find models with all the standard filters and sort by trending. The trending score is HF's own hype signal (combines fresh downloads + likes + Spaces usage).
{"searchType": "models","pipelineTag": "text-generation","minDownloads": 10000,"sort": "trendingScore","maxResults": 100,"parseTagsStructured": true,"includeAuthorProfile": true}
Common pipeline tags: text-generation, text-classification, feature-extraction, sentence-similarity, image-classification, image-to-image, text-to-image, automatic-speech-recognition, text-to-speech, translation, summarization, question-answering, token-classification, object-detection, depth-estimation.
Common library filters: transformers, diffusers, sentence-transformers, gguf, llama.cpp, mlx, coreml, tensorrt-llm.
2. datasets β HuggingFace dataset search and export
Use the dataset mode to discover or audit training data, or to build a HuggingFace-to-CSV pipeline. The actor returns the dataset metadata; for the actual rows, point your downstream tooling at the canonical huggingface.co/datasets/... resolver.
{"searchType": "datasets","searchQuery": "instruction","language": "en","minDownloads": 1000,"sort": "downloads","maxResults": 200}
Common queries: instruction tuning, dpo, code, medical, legal, multilingual, image-text, function-calling, tool-use, safety, red-teaming.
3. spaces β HuggingFace Space discovery
Find Gradio / Streamlit / Docker / static Spaces. Use this mode for competitive intel on AI demos, recruiter sourcing on ML engineers shipping public apps, or for building a "best Spaces of the week" feed.
{"searchType": "spaces","sdk": "gradio","minLikes": 100,"sort": "trendingScore","maxResults": 100}
4. papers β HuggingFace Daily Papers
The Daily Papers section curates community-submitted arXiv papers with AI-written summaries and upvote counts. This actor returns title, summary, arxivId, upvotes, comments and the submitting author.
{"searchType": "papers","sort": "trendingScore","maxResults": 50}
5. byIds β Bulk HuggingFace lookup by ID
Hand the actor a list of author/name IDs and it returns the full record for each, across models / datasets / spaces. Perfect for enriching a CSV you already have, or auditing a leaderboard.
{"searchType": "byIds","ids": ["meta-llama/Llama-3.1-8B-Instruct","mistralai/Mistral-7B-Instruct-v0.3","Qwen/Qwen2.5-7B-Instruct","google/gemma-2-9b-it","microsoft/Phi-3.5-mini-instruct"],"includeAuthorProfile": true}
π‘ Use cases
This HuggingFace scraper is designed for AI competitive intelligence, ML lead generation, talent sourcing, and dataset engineering.
- AI competitive intelligence β track every new fine-tune of Llama, Mistral, Qwen, Gemma, Phi. Filter by license, parameters, framework, dataset and pull the author profile to know who is shipping
- ML lead generation β find every author who released a popular model in your niche (RAG, voice, vision, robotics) and reach out with the enriched website + LinkedIn
- Recruiter sourcing for ML engineers β verified, public proof-of-work + author profile + secondary emails. Beats LinkedIn Recruiter for AI roles
- VC ecosystem mapping β combine
numModels+numDatasets+followersper organization to surface fast-growing AI labs and emerging research groups - Trending model digest / newsletter β daily run sorted by
trendingScoreproduces a clean "what's hot on HuggingFace today" feed - Foundation-model leaderboard β pull every
text-generationmodel withparameterspopulated and rank by your own criteria (parameters + downloads + license) - Dataset audit and lineage β for any model, the parsed
cardDataincludes the dataset(s) it was trained on. Build a model-to-dataset graph - Convert HuggingFace dataset to CSV β get the dataset metadata, then download the raw files from the resolver
- Build a Spaces showcase β top Gradio Spaces by likes, deduplicated by author. Great for AI tool directories
- Brand monitoring on HuggingFace β find every model / dataset mentioning your company name, your paper, or your API as an integration
π§Ύ Example output
A single record from a byIds: ["meta-llama/Llama-3.1-8B-Instruct"] run (truncated):
{"id": "meta-llama/Llama-3.1-8B-Instruct","name": "Llama-3.1-8B-Instruct","author": "meta-llama","type": "model","pipelineTag": "text-generation","library": "transformers","tagsStructured": {"license": "llama3.1","languages": ["en", "de", "fr", "it", "pt", "hi", "es", "th"],"datasetsUsed": [],"arxivPapers": ["2407.21783"],"region": ["us"],"frameworks": ["pytorch", "safetensors"]},"downloads": 4823910,"likes": 3915,"trendingScore": 88.4,"createdAt": "2024-07-18T08:54:01.000Z","lastModified": "2026-04-22T16:30:11.000Z","gated": true,"parameters": 8030261248,"usedStorageBytes": 16060522496,"spacesUsing": 1342,"authorProfile": {"followers": 24117,"isPro": false,"numModels": 39,"numDatasets": 4,"numSpaces": 1,"orgs": ["meta", "facebook"]},"url": "https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"}
π Compared to alternatives
| Tool | Maintainer emails | Downloads stats | Bulk lookup | Tag parsing | Web enrichment | Cost |
|---|---|---|---|---|---|---|
| HuggingFace Hub Scraper (this actor) | β via enrichment | β Lifetime + trending | β Up to 5,000 | β Structured | β Optional | Pay-per-event |
huggingface_hub Python SDK | β | β οΈ Per call | β οΈ Loops only | β | β | Free, slow |
| HuggingFace REST API | β | β οΈ Per call | β οΈ Loops only | β | β | Free, rate-limited |
| Papers With Code | β | β | β οΈ | β | β | Free |
| OpenReview scrapers | β | β | β | β | β | Free |
If you only need 10 records, the official SDK is fine. For thousands of records, structured tags, downloads stats, author profiles and email enrichment in one run, this actor is the fastest path.
βοΈ Input parameters reference
| Parameter | Type | Default | Description |
|---|---|---|---|
searchType | string enum | models | models / datasets / spaces / papers / byIds |
ids | string[] | β | Used with byIds. author/name per line |
searchQuery | string | β | Free-text across IDs and tags |
author | string | β | Filter by author / organization (meta-llama, google, mistralai) |
pipelineTag | string | β | Task (text-generation, text-classification, β¦) |
library | string | β | Library (transformers, diffusers, β¦) |
language | string | β | ISO code (en, es, multilingual, β¦) |
license | string | β | License filter (apache-2.0, mit, llama3.1, β¦) |
sdk | string | β | Spaces only: docker / gradio / streamlit / static |
minDownloads | integer | β | Drop records below this download count |
minLikes | integer | β | Drop records below this likes count |
sort | string enum | trendingScore | trendingScore / downloads / likes / lastModified / createdAt |
maxResults | integer | 100 | Hard cap (1β5,000) |
parseTagsStructured | boolean | true | Split flat tags into structured fields |
includeAuthorProfile | boolean | false | Fetch author / org profile |
enrichWithGoogle | boolean | false | Find website + LinkedIn + secondary emails per author |
enrichLimit | integer | 50 | Max unique authors to enrich (1β1,000) |
proxyConfig | proxy | residential | Used for enrichment only |
π° Pricing & cost
Pay-per-event:
- Per record returned β small fee, linear with results
- Per enriched author β only when
enrichWithGoogle: true, capped byenrichLimit
A 1,000-model pull without enrichment is essentially free. With author profile + enrichment on 50 unique authors, you stay under a few dollars per run.
The actor only billing-events when a record actually lands in the Dataset. Retries, rate-limit backoffs and partial failures are not charged.
β Frequently asked questions
Is this an official HuggingFace product?
No. It calls the same public huggingface.co/api endpoints that the official huggingface_hub Python SDK uses. No HF_TOKEN required.
Do you respect HuggingFace terms of service? Yes. We only read public endpoints. We add polite delays and exponential backoff on 429 responses.
Can I get the model weights / dataset files themselves?
No. The actor returns metadata only (which includes the file list via siblings). To download the binary files, use HuggingFace's resolver (https://huggingface.co/<id>/resolve/main/<file>) with the standard huggingface_hub SDK.
How fresh is the data? Live. Every request hits the HuggingFace Hub in real time.
Can I convert a HuggingFace dataset to CSV?
The actor returns dataset metadata (including the siblings list, which is the file manifest). For the actual rows, download the Parquet / JSON / CSV files declared in siblings and convert as you wish.
What is trendingScore?
HuggingFace's own hype signal. It combines recent downloads, likes, and Spaces usage to rank "what is hot right now". Useful for newsletter automation.
Why is parameters sometimes null?
HuggingFace populates parameters from a model's safetensors index or config.json. Some older models or non-transformer models don't publish that, so the field is null.
How do I find the most popular HuggingFace models?
Set searchType: "models", sort: "downloads", optionally pipelineTag to scope by task, and increase maxResults. For "fresh hype" use sort: "trendingScore".
Can I scrape gated models?
You can read their metadata. The gated: true field tells you the weights require user acceptance. The scraper does not bypass any gating.
Does the enrichment really find author emails? Yes, when the author has published an email anywhere on their website, GitHub, LinkedIn or academic page. The SERP fetcher follows the same approach as Apollo / Hunter, applied to AI researchers and ML engineers.
Can I run this on a schedule?
Yes. Apify Schedules supports cron expressions. A daily run sorted by trendingScore produces a "what's new in AI today" feed.
How does this compare to the huggingface_hub Python SDK?
The SDK is great for single calls in a Python script. For bulk extraction (1K+ records), structured tags, downloads stats, author profiles and email enrichment in one run, this actor is much faster and exports straight to CSV / JSON / Excel.
Can I integrate the actor with Claude, Cursor or other AI agents? Yes β call the actor via the Apify API from your agent or use Apify's MCP server wrapper. I also publish dedicated MCP server actors (see below).
π Other actors by makework36
Useful companions for the AI / ML stack:
- Reddit MCP Server β Reddit access for Claude, Cursor, ChatGPT, Codex
- Flight Price MCP Server β flight prices for AI agents
- Skyscanner MCP Server β flight search MCP
- Airbnb MCP Server β vacation-rental data for AI agents
- NPM Package Scraper β JavaScript ecosystem data + maintainer emails
- Lovable Sites Scraper β discover
.lovable.appAI-built apps - StackOverflow Scraper β questions, answers and tags
- Website Email & Contact Finder β extract emails from any URL
- Reddit Product Research Scraper β reviews and recommendations
- Reddit SaaS Leads Scraper β startup pain points and early adopters
- Substack Scraper β newsletter posts and authors
- Facebook Ad Library Scraper β competitor ad intelligence
π Changelog
- v0.1 β Initial release. Five search modes (models / datasets / spaces / papers / byIds), structured tag parsing, author profile, optional Google enrichment.
π οΈ Support
Missing a field, hit a bug, or want a new mode? Open an issue or message me directly from the Apify Console. I respond fast and ship fixes within hours for paying users.