Pricing

from $2.50 / 1,000 huggingface records

HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.

Pricing

from $2.50 / 1,000 huggingface records

Rating

0.0

(0)

Developer

deusex machine

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

HuggingFace Hub Scraper — Models, Datasets, Spaces, Papers & Authors

Scrape the HuggingFace Hub with 30+ fields per record. Use this HuggingFace scraper as a no-auth, no-rate-limit alternative to the official HuggingFace Hub API: search models, datasets, spaces and daily papers, filter by author, task, library, language, license, downloads or likes, parse the flat tag array into structured columns, and export everything to CSV, JSON, Excel or a queryable database.

If you have tried building anything on top of the HuggingFace Hub API you already know the friction: paginated REST endpoints with inconsistent shapes, tags packed into a flat string array, no first-class downloads-stats endpoint, and author profiles that require yet another HTTP call. This actor unifies all of that into one schema, runs entirely against the public HF endpoints (no HF_TOKEN, no API key) and adds an optional web-enrichment step for outreach.

💡 Looking for HuggingFace data, an HF model finder, an HF dataset list, or a way to convert a HuggingFace dataset to CSV? This is the actor. It supports the four main HF resources — models, datasets, spaces, papers — plus bulk lookup by ID.

✨ Why use this scraper

30+ structured fields per record — id, author, pipelineTag, library, parameters, usedStorageBytes, inferenceStatus, gated, widgetData, spacesUsing, siblings, config, cardData, arxivPapers, datasetsUsed, and more
Structured tag parsing — the flat tags array gets split into license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks
Author / organization profile — followers, isPro, numModels, numDatasets, numSpaces, numPapers, list of organizations
Web enrichment — for every unique author, find their personal or company website, LinkedIn, Facebook and secondary emails via a SERP fetcher (no API key)
5 modes — Models, Datasets, Spaces, Daily Papers, plus bulk Lookup-by-IDs
Filters that actually work — author / organization, pipeline tag (task), library, language, license, SDK (for spaces), minimum downloads, minimum likes
Sorting — trending score (fresh hype), downloads, likes, last modified, created at
Outputs — Apify Dataset → CSV, JSON, Excel, XML, RSS, HTML

Built for AI researchers, ML platform teams, dev-tools founders building on top of HuggingFace, recruiters sourcing ML talent, VCs mapping the foundation-model landscape, and DevRel teams running outreach to model authors.

📤 Output fields

Field	Description
`id`	Full ID (`author/name`)
`name`	Short name without author prefix
`author`	Author or organization handle (e.g. `meta-llama`, `mistralai`)
`type`	`model` / `dataset` / `space` / `paper`
`pipelineTag`	Primary task (text-generation, image-classification, …)
`library`	Library (transformers, diffusers, sentence-transformers, gguf, …)
`tags`	Raw flat tags array (HF format)
`tagsStructured`	Parsed object: `{ license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks }`
`downloads`	Total downloads (lifetime)
`downloadsAllTime`	Same as above; alias for compatibility
`likes`	Total likes
`trendingScore`	HF's own trending signal
`createdAt`	Creation timestamp
`lastModified`	Last modification timestamp
`private`	Whether the record is private (always `false` for public scrape)
`gated`	Whether the model is gated (requires acceptance)
`disabled`	Whether the record is disabled
`inferenceStatus`	Inference API availability (`live`, `loading`, `error`)
`parameters`	Number of parameters (when published in `safetensors` / `config.json`)
`usedStorageBytes`	Model artifact size on disk
`widgetData`	Widget config (when present)
`spacesUsing`	Count of Spaces referencing this model
`siblings`	File list inside the repo (name + size)
`config`	Parsed `config.json` (model architecture, vocab size, hidden size, …)
`cardData`	Parsed YAML front-matter of the README ("model card")
`arxivPapers`	Array of arXiv IDs declared in tags or cardData
`datasetsUsed`	Array of datasets declared as training sources
`frameworks`	Frameworks (pytorch, tensorflow, jax, ggml, …)
`license`	Top-level license from cardData
`languages`	ISO codes (en, es, multilingual, …)
`authorProfile`	`{ followers, isPro, numModels, numDatasets, numSpaces, numPapers, orgs[] }`
`enrichment`	`{ website, linkedin, facebook, twitter, emails[] }`
`url`	Canonical `huggingface.co/...` URL

For Spaces, the schema adds sdk (docker / gradio / streamlit / static), runtime, models and datasets declared in README.md. For Papers, the schema adds title, summary, arxivId, upvotes, commentsCount, submittedBy.

🎯 Search modes

1. `models` — HuggingFace model search

Find models with all the standard filters and sort by trending. The trending score is HF's own hype signal (combines fresh downloads + likes + Spaces usage).

{
  "searchType": "models",
  "pipelineTag": "text-generation",
  "minDownloads": 10000,
  "sort": "trendingScore",
  "maxResults": 100,
  "parseTagsStructured": true,
  "includeAuthorProfile": true
}

Common pipeline tags: text-generation, text-classification, feature-extraction, sentence-similarity, image-classification, image-to-image, text-to-image, automatic-speech-recognition, text-to-speech, translation, summarization, question-answering, token-classification, object-detection, depth-estimation.

Common library filters: transformers, diffusers, sentence-transformers, gguf, llama.cpp, mlx, coreml, tensorrt-llm.

2. `datasets` — HuggingFace dataset search and export

Use the dataset mode to discover or audit training data, or to build a HuggingFace-to-CSV pipeline. The actor returns the dataset metadata; for the actual rows, point your downstream tooling at the canonical huggingface.co/datasets/... resolver.

{
  "searchType": "datasets",
  "searchQuery": "instruction",
  "language": "en",
  "minDownloads": 1000,
  "sort": "downloads",
  "maxResults": 200
}

Common queries: instruction tuning, dpo, code, medical, legal, multilingual, image-text, function-calling, tool-use, safety, red-teaming.

3. `spaces` — HuggingFace Space discovery

Find Gradio / Streamlit / Docker / static Spaces. Use this mode for competitive intel on AI demos, recruiter sourcing on ML engineers shipping public apps, or for building a "best Spaces of the week" feed.

{
  "searchType": "spaces",
  "sdk": "gradio",
  "minLikes": 100,
  "sort": "trendingScore",
  "maxResults": 100
}

4. `papers` — HuggingFace Daily Papers

The Daily Papers section curates community-submitted arXiv papers with AI-written summaries and upvote counts. This actor returns title, summary, arxivId, upvotes, comments and the submitting author.

{
  "searchType": "papers",
  "sort": "trendingScore",
  "maxResults": 50
}

5. `byIds` — Bulk HuggingFace lookup by ID

Hand the actor a list of author/name IDs and it returns the full record for each, across models / datasets / spaces. Perfect for enriching a CSV you already have, or auditing a leaderboard.

{
  "searchType": "byIds",
  "ids": [
    "meta-llama/Llama-3.1-8B-Instruct",
    "mistralai/Mistral-7B-Instruct-v0.3",
    "Qwen/Qwen2.5-7B-Instruct",
    "google/gemma-2-9b-it",
    "microsoft/Phi-3.5-mini-instruct"
  ],
  "includeAuthorProfile": true
}

🎯 Use cases

This HuggingFace scraper is designed for AI competitive intelligence, ML lead generation, talent sourcing, and dataset engineering.

AI competitive intelligence — track every new fine-tune of Llama, Mistral, Qwen, Gemma, Phi. Filter by license, parameters, framework, dataset and pull the author profile to know who is shipping
ML lead generation — find every author who released a popular model in your niche (RAG, voice, vision, robotics) and reach out with the enriched website + LinkedIn
Recruiter sourcing for ML engineers — verified, public proof-of-work + author profile + secondary emails. Beats LinkedIn Recruiter for AI roles
VC ecosystem mapping — combine numModels + numDatasets + followers per organization to surface fast-growing AI labs and emerging research groups
Trending model digest / newsletter — daily run sorted by trendingScore produces a clean "what's hot on HuggingFace today" feed
Foundation-model leaderboard — pull every text-generation model with parameters populated and rank by your own criteria (parameters + downloads + license)
Dataset audit and lineage — for any model, the parsed cardData includes the dataset(s) it was trained on. Build a model-to-dataset graph
Convert HuggingFace dataset to CSV — get the dataset metadata, then download the raw files from the resolver
Build a Spaces showcase — top Gradio Spaces by likes, deduplicated by author. Great for AI tool directories
Brand monitoring on HuggingFace — find every model / dataset mentioning your company name, your paper, or your API as an integration

💻 Code examples

A single record from a byIds: ["meta-llama/Llama-3.1-8B-Instruct"] run (truncated):

{
  "id": "meta-llama/Llama-3.1-8B-Instruct",
  "name": "Llama-3.1-8B-Instruct",
  "author": "meta-llama",
  "type": "model",
  "pipelineTag": "text-generation",
  "library": "transformers",
  "tagsStructured": {
    "license": "llama3.1",
    "languages": ["en", "de", "fr", "it", "pt", "hi", "es", "th"],
    "datasetsUsed": [],
    "arxivPapers": ["2407.21783"],
    "region": ["us"],
    "frameworks": ["pytorch", "safetensors"]
  },
  "downloads": 4823910,
  "likes": 3915,
  "trendingScore": 88.4,
  "createdAt": "2024-07-18T08:54:01.000Z",
  "lastModified": "2026-04-22T16:30:11.000Z",
  "gated": true,
  "parameters": 8030261248,
  "usedStorageBytes": 16060522496,
  "spacesUsing": 1342,
  "authorProfile": {
    "followers": 24117,
    "isPro": false,
    "numModels": 39,
    "numDatasets": 4,
    "numSpaces": 1,
    "orgs": ["meta", "facebook"]
  },
  "url": "https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"
}

📊 Comparison

Tool	Maintainer emails	Downloads stats	Bulk lookup	Tag parsing	Web enrichment	Cost
HuggingFace Hub Scraper (this actor)	✅ via enrichment	✅ Lifetime + trending	✅ Up to 5,000	✅ Structured	✅ Optional	Pay-per-event
`huggingface_hub` Python SDK	❌	⚠️ Per call	⚠️ Loops only	❌	❌	Free, slow
HuggingFace REST API	❌	⚠️ Per call	⚠️ Loops only	❌	❌	Free, rate-limited
Papers With Code	❌	❌	⚠️	❌	❌	Free
OpenReview scrapers	❌	❌	❌	❌	❌	Free

If you only need 10 records, the official SDK is fine. For thousands of records, structured tags, downloads stats, author profiles and email enrichment in one run, this actor is the fastest path.

📥 Input

Parameter	Type	Default	Description
`searchType`	string enum	`models`	`models` / `datasets` / `spaces` / `papers` / `byIds`
`ids`	string[]	—	Used with `byIds`. `author/name` per line
`searchQuery`	string	—	Free-text across IDs and tags
`author`	string	—	Filter by author / organization (`meta-llama`, `google`, `mistralai`)
`pipelineTag`	string	—	Task (`text-generation`, `text-classification`, …)
`library`	string	—	Library (`transformers`, `diffusers`, …)
`language`	string	—	ISO code (`en`, `es`, `multilingual`, …)
`license`	string	—	License filter (`apache-2.0`, `mit`, `llama3.1`, …)
`sdk`	string	—	Spaces only: `docker` / `gradio` / `streamlit` / `static`
`minDownloads`	integer	—	Drop records below this download count
`minLikes`	integer	—	Drop records below this likes count
`sort`	string enum	`trendingScore`	`trendingScore` / `downloads` / `likes` / `lastModified` / `createdAt`
`maxResults`	integer	`100`	Hard cap (1–5,000)
`parseTagsStructured`	boolean	`true`	Split flat tags into structured fields
`includeAuthorProfile`	boolean	`false`	Fetch author / org profile
`enrichWithGoogle`	boolean	`false`	Find website + LinkedIn + secondary emails per author
`enrichLimit`	integer	`50`	Max unique authors to enrich (1–1,000)
`proxyConfig`	proxy	residential	Used for enrichment only

💵 Pricing

Pay-per-event:

Per record returned — small fee, linear with results
Per enriched author — only when enrichWithGoogle: true, capped by enrichLimit

A 1,000-model pull without enrichment is essentially free. With author profile + enrichment on 50 unique authors, you stay under a few dollars per run.

The actor only billing-events when a record actually lands in the Dataset. Retries, rate-limit backoffs and partial failures are not charged.

❓ FAQ

Is this an official HuggingFace product? No. It calls the same public huggingface.co/api endpoints that the official huggingface_hub Python SDK uses. No HF_TOKEN required.

Do you respect HuggingFace terms of service? Yes. We only read public endpoints. We add polite delays and exponential backoff on 429 responses.

Can I get the model weights / dataset files themselves? No. The actor returns metadata only (which includes the file list via siblings). To download the binary files, use HuggingFace's resolver (https://huggingface.co/<id>/resolve/main/<file>) with the standard huggingface_hub SDK.

How fresh is the data? Live. Every request hits the HuggingFace Hub in real time.

Can I convert a HuggingFace dataset to CSV? The actor returns dataset metadata (including the siblings list, which is the file manifest). For the actual rows, download the Parquet / JSON / CSV files declared in siblings and convert as you wish.

What is trendingScore? HuggingFace's own hype signal. It combines recent downloads, likes, and Spaces usage to rank "what is hot right now". Useful for newsletter automation.

Why is parameters sometimes null? HuggingFace populates parameters from a model's safetensors index or config.json. Some older models or non-transformer models don't publish that, so the field is null.

How do I find the most popular HuggingFace models? Set searchType: "models", sort: "downloads", optionally pipelineTag to scope by task, and increase maxResults. For "fresh hype" use sort: "trendingScore".

Can I scrape gated models? You can read their metadata. The gated: true field tells you the weights require user acceptance. The scraper does not bypass any gating.

Does the enrichment really find author emails? Yes, when the author has published an email anywhere on their website, GitHub, LinkedIn or academic page. The SERP fetcher follows the same approach as Apollo / Hunter, applied to AI researchers and ML engineers.

Can I run this on a schedule? Yes. Apify Schedules supports cron expressions. A daily run sorted by trendingScore produces a "what's new in AI today" feed.

How does this compare to the huggingface_hub Python SDK? The SDK is great for single calls in a Python script. For bulk extraction (1K+ records), structured tags, downloads stats, author profiles and email enrichment in one run, this actor is much faster and exports straight to CSV / JSON / Excel.

Can I integrate the actor with Claude, Cursor or other AI agents? Yes — call the actor via the Apify API from your agent or use Apify's MCP server wrapper. I also publish dedicated MCP server actors (see below).

Useful companions for the AI / ML stack:

Reddit MCP Server — Reddit access for Claude, Cursor, ChatGPT, Codex
Flight Price MCP Server — flight prices for AI agents
Skyscanner MCP Server — flight search MCP
Airbnb MCP Server — vacation-rental data for AI agents
NPM Package Scraper — JavaScript ecosystem data + maintainer emails
Lovable Sites Scraper — discover .lovable.app AI-built apps
StackOverflow Scraper — questions, answers and tags
Website Email & Contact Finder — extract emails from any URL
Reddit Product Research Scraper — reviews and recommendations
Reddit SaaS Leads Scraper — startup pain points and early adopters
Substack Scraper — newsletter posts and authors
Facebook Ad Library Scraper — competitor ad intelligence

📝 Changelog

v0.1 — Initial release. Five search modes (models / datasets / spaces / papers / byIds), structured tag parsing, author profile, optional Google enrichment.

📞 Support

Missing a field, hit a bug, or want a new mode? Open an issue or message me directly from the Apify Console. I respond fast and ship fixes within hours for paying users.

HuggingFace Hub Scraper - Models, Datasets, Spaces

wetyr_corporation/huggingface-hub-scraper

Bulk extract AI models, datasets, and Spaces from HuggingFace. Filter by task, library, license, author. Pulls downloads, likes, tags, model cards.

WETYR

HuggingFace Scraper — Models, Datasets & Spaces

devilscrapes/huggingface-hub-scraper

Export models, datasets, and Spaces from the HuggingFace Hub API — filter by task, library, or author, with a trending snapshot mode — to JSON or CSV. Richer schema than incumbents: downloads, likes, tags, license, last-modified. No login.

DevilScrapes

Huggingface Models Scraper

klondikeking/huggingface-models-scraper

Pierrick McD0nald

HuggingFace Hub Scraper

crawlerbros/huggingface-scraper

Scrape Hugging Face Hub, search and fetch models, datasets, and spaces with full metadata: downloads, likes, license, pipeline tag, library, tags, files, and more. Pure HTTP, no auth required.

Crawler Bros

HuggingFace Models Datasets Spaces Scraper - Low-cost💲🔥🤖🤗

delectable_incubator/huggingface-models-datasets-spaces-scraper-low-cost

Scrape Hugging Face Models, Datasets & Spaces 🤖📊 with a powerful AI ecosystem scraper. Extract repository names, owners, tags, downloads, likes, update dates, source URLs and more from keyword searches. Ideal for AI research, model discovery, dataset analysis and machine learning intelligence 🚀🌐

Prime Scrape

HuggingFace Models Scraper

tzmyk/huggingface-models-scraper

Scrapes AI/ML models from HuggingFace (huggingface.co/models) via the official API. Extracts model ID, downloads, likes, task type, library, tags, and more. Supports search, author/org filter, pipeline tag filter, and sort order.

tzmyk

Hugging Face Hub API

alizarin_refrigerator-owner/hugging-face-hub

Access the Hugging Face Hub API to search & discover models, datasets & spaces. Search Models: Find ML models by name, task or library Search Datasets: Discover datasets for training & evaluation Search Spaces: Explore ML applications Get Metadata: Retrieve detailed repo information

The Howlers

Huggingface Scraper

fortuitous_pirate/huggingface-scraper

Huggingface Scraper. Structured data export for lead generation, enrichment, and competitive research.

Fortuitous Pirate

Huggingface Models

david_flagg/huggingface-models

Scrape model metadata from HuggingFace Hub — the largest open-source ML model registry. Get downloads, likes, trending scores, licenses, tags, and architecture info for 1M+ models. Filter by task type, ML library, or author. Uses the official HF API — no auth required.

David Flagg

HuggingFace Scraper (All-in-One) 🚀🤗🔎

scrapestorm/huggingface-scraper-all-in-one

🟠 Easily collect Models, Datasets & Spaces from Hugging Face Provide one or multiple search keywords and extract data across the entire HuggingFace ecosystem including Repository name 👤 Owner 🔗 Source search URL & more… Perfect for AI architecture research & full ecosystem intelligence 🚀🤖