HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors avatar

HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

Pricing

from $2.50 / 1,000 huggingface records

Go to Apify Store
HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.

Pricing

from $2.50 / 1,000 huggingface records

Rating

0.0

(0)

Developer

deusex machine

deusex machine

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

HuggingFace Hub Scraper β€” Models, Datasets, Spaces, Papers & Authors

Scrape the HuggingFace Hub with 30+ fields per record. Use this HuggingFace scraper as a no-auth, no-rate-limit alternative to the official HuggingFace Hub API: search models, datasets, spaces and daily papers, filter by author, task, library, language, license, downloads or likes, parse the flat tag array into structured columns, and export everything to CSV, JSON, Excel or a queryable database.

If you have tried building anything on top of the HuggingFace Hub API you already know the friction: paginated REST endpoints with inconsistent shapes, tags packed into a flat string array, no first-class downloads-stats endpoint, and author profiles that require yet another HTTP call. This actor unifies all of that into one schema, runs entirely against the public HF endpoints (no HF_TOKEN, no API key) and adds an optional web-enrichment step for outreach.

πŸ’‘ Looking for HuggingFace data, an HF model finder, an HF dataset list, or a way to convert a HuggingFace dataset to CSV? This is the actor. It supports the four main HF resources β€” models, datasets, spaces, papers β€” plus bulk lookup by ID.


πŸš€ Why this HuggingFace scraper

  • 30+ structured fields per record β€” id, author, pipelineTag, library, parameters, usedStorageBytes, inferenceStatus, gated, widgetData, spacesUsing, siblings, config, cardData, arxivPapers, datasetsUsed, and more
  • Structured tag parsing β€” the flat tags array gets split into license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks
  • Author / organization profile β€” followers, isPro, numModels, numDatasets, numSpaces, numPapers, list of organizations
  • Web enrichment β€” for every unique author, find their personal or company website, LinkedIn, Facebook and secondary emails via a SERP fetcher (no API key)
  • 5 modes β€” Models, Datasets, Spaces, Daily Papers, plus bulk Lookup-by-IDs
  • Filters that actually work β€” author / organization, pipeline tag (task), library, language, license, SDK (for spaces), minimum downloads, minimum likes
  • Sorting β€” trending score (fresh hype), downloads, likes, last modified, created at
  • Outputs β€” Apify Dataset β†’ CSV, JSON, Excel, XML, RSS, HTML

Built for AI researchers, ML platform teams, dev-tools founders building on top of HuggingFace, recruiters sourcing ML talent, VCs mapping the foundation-model landscape, and DevRel teams running outreach to model authors.


πŸ“Š What this HuggingFace Hub Scraper extracts

FieldDescription
idFull ID (author/name)
nameShort name without author prefix
authorAuthor or organization handle (e.g. meta-llama, mistralai)
typemodel / dataset / space / paper
pipelineTagPrimary task (text-generation, image-classification, …)
libraryLibrary (transformers, diffusers, sentence-transformers, gguf, …)
tagsRaw flat tags array (HF format)
tagsStructuredParsed object: { license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks }
downloadsTotal downloads (lifetime)
downloadsAllTimeSame as above; alias for compatibility
likesTotal likes
trendingScoreHF's own trending signal
createdAtCreation timestamp
lastModifiedLast modification timestamp
privateWhether the record is private (always false for public scrape)
gatedWhether the model is gated (requires acceptance)
disabledWhether the record is disabled
inferenceStatusInference API availability (live, loading, error)
parametersNumber of parameters (when published in safetensors / config.json)
usedStorageBytesModel artifact size on disk
widgetDataWidget config (when present)
spacesUsingCount of Spaces referencing this model
siblingsFile list inside the repo (name + size)
configParsed config.json (model architecture, vocab size, hidden size, …)
cardDataParsed YAML front-matter of the README ("model card")
arxivPapersArray of arXiv IDs declared in tags or cardData
datasetsUsedArray of datasets declared as training sources
frameworksFrameworks (pytorch, tensorflow, jax, ggml, …)
licenseTop-level license from cardData
languagesISO codes (en, es, multilingual, …)
authorProfile{ followers, isPro, numModels, numDatasets, numSpaces, numPapers, orgs[] }
enrichment{ website, linkedin, facebook, twitter, emails[] }
urlCanonical huggingface.co/... URL

For Spaces, the schema adds sdk (docker / gradio / streamlit / static), runtime, models and datasets declared in README.md. For Papers, the schema adds title, summary, arxivId, upvotes, commentsCount, submittedBy.


🎯 Search modes

Find models with all the standard filters and sort by trending. The trending score is HF's own hype signal (combines fresh downloads + likes + Spaces usage).

{
"searchType": "models",
"pipelineTag": "text-generation",
"minDownloads": 10000,
"sort": "trendingScore",
"maxResults": 100,
"parseTagsStructured": true,
"includeAuthorProfile": true
}

Common pipeline tags: text-generation, text-classification, feature-extraction, sentence-similarity, image-classification, image-to-image, text-to-image, automatic-speech-recognition, text-to-speech, translation, summarization, question-answering, token-classification, object-detection, depth-estimation.

Common library filters: transformers, diffusers, sentence-transformers, gguf, llama.cpp, mlx, coreml, tensorrt-llm.

2. datasets β€” HuggingFace dataset search and export

Use the dataset mode to discover or audit training data, or to build a HuggingFace-to-CSV pipeline. The actor returns the dataset metadata; for the actual rows, point your downstream tooling at the canonical huggingface.co/datasets/... resolver.

{
"searchType": "datasets",
"searchQuery": "instruction",
"language": "en",
"minDownloads": 1000,
"sort": "downloads",
"maxResults": 200
}

Common queries: instruction tuning, dpo, code, medical, legal, multilingual, image-text, function-calling, tool-use, safety, red-teaming.

3. spaces β€” HuggingFace Space discovery

Find Gradio / Streamlit / Docker / static Spaces. Use this mode for competitive intel on AI demos, recruiter sourcing on ML engineers shipping public apps, or for building a "best Spaces of the week" feed.

{
"searchType": "spaces",
"sdk": "gradio",
"minLikes": 100,
"sort": "trendingScore",
"maxResults": 100
}

4. papers β€” HuggingFace Daily Papers

The Daily Papers section curates community-submitted arXiv papers with AI-written summaries and upvote counts. This actor returns title, summary, arxivId, upvotes, comments and the submitting author.

{
"searchType": "papers",
"sort": "trendingScore",
"maxResults": 50
}

5. byIds β€” Bulk HuggingFace lookup by ID

Hand the actor a list of author/name IDs and it returns the full record for each, across models / datasets / spaces. Perfect for enriching a CSV you already have, or auditing a leaderboard.

{
"searchType": "byIds",
"ids": [
"meta-llama/Llama-3.1-8B-Instruct",
"mistralai/Mistral-7B-Instruct-v0.3",
"Qwen/Qwen2.5-7B-Instruct",
"google/gemma-2-9b-it",
"microsoft/Phi-3.5-mini-instruct"
],
"includeAuthorProfile": true
}

πŸ’‘ Use cases

This HuggingFace scraper is designed for AI competitive intelligence, ML lead generation, talent sourcing, and dataset engineering.

  • AI competitive intelligence β€” track every new fine-tune of Llama, Mistral, Qwen, Gemma, Phi. Filter by license, parameters, framework, dataset and pull the author profile to know who is shipping
  • ML lead generation β€” find every author who released a popular model in your niche (RAG, voice, vision, robotics) and reach out with the enriched website + LinkedIn
  • Recruiter sourcing for ML engineers β€” verified, public proof-of-work + author profile + secondary emails. Beats LinkedIn Recruiter for AI roles
  • VC ecosystem mapping β€” combine numModels + numDatasets + followers per organization to surface fast-growing AI labs and emerging research groups
  • Trending model digest / newsletter β€” daily run sorted by trendingScore produces a clean "what's hot on HuggingFace today" feed
  • Foundation-model leaderboard β€” pull every text-generation model with parameters populated and rank by your own criteria (parameters + downloads + license)
  • Dataset audit and lineage β€” for any model, the parsed cardData includes the dataset(s) it was trained on. Build a model-to-dataset graph
  • Convert HuggingFace dataset to CSV β€” get the dataset metadata, then download the raw files from the resolver
  • Build a Spaces showcase β€” top Gradio Spaces by likes, deduplicated by author. Great for AI tool directories
  • Brand monitoring on HuggingFace β€” find every model / dataset mentioning your company name, your paper, or your API as an integration

🧾 Example output

A single record from a byIds: ["meta-llama/Llama-3.1-8B-Instruct"] run (truncated):

{
"id": "meta-llama/Llama-3.1-8B-Instruct",
"name": "Llama-3.1-8B-Instruct",
"author": "meta-llama",
"type": "model",
"pipelineTag": "text-generation",
"library": "transformers",
"tagsStructured": {
"license": "llama3.1",
"languages": ["en", "de", "fr", "it", "pt", "hi", "es", "th"],
"datasetsUsed": [],
"arxivPapers": ["2407.21783"],
"region": ["us"],
"frameworks": ["pytorch", "safetensors"]
},
"downloads": 4823910,
"likes": 3915,
"trendingScore": 88.4,
"createdAt": "2024-07-18T08:54:01.000Z",
"lastModified": "2026-04-22T16:30:11.000Z",
"gated": true,
"parameters": 8030261248,
"usedStorageBytes": 16060522496,
"spacesUsing": 1342,
"authorProfile": {
"followers": 24117,
"isPro": false,
"numModels": 39,
"numDatasets": 4,
"numSpaces": 1,
"orgs": ["meta", "facebook"]
},
"url": "https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"
}

πŸ†š Compared to alternatives

ToolMaintainer emailsDownloads statsBulk lookupTag parsingWeb enrichmentCost
HuggingFace Hub Scraper (this actor)βœ… via enrichmentβœ… Lifetime + trendingβœ… Up to 5,000βœ… Structuredβœ… OptionalPay-per-event
huggingface_hub Python SDK❌⚠️ Per call⚠️ Loops only❌❌Free, slow
HuggingFace REST API❌⚠️ Per call⚠️ Loops only❌❌Free, rate-limited
Papers With Code❌❌⚠️❌❌Free
OpenReview scrapers❌❌❌❌❌Free

If you only need 10 records, the official SDK is fine. For thousands of records, structured tags, downloads stats, author profiles and email enrichment in one run, this actor is the fastest path.


βš™οΈ Input parameters reference

ParameterTypeDefaultDescription
searchTypestring enummodelsmodels / datasets / spaces / papers / byIds
idsstring[]β€”Used with byIds. author/name per line
searchQuerystringβ€”Free-text across IDs and tags
authorstringβ€”Filter by author / organization (meta-llama, google, mistralai)
pipelineTagstringβ€”Task (text-generation, text-classification, …)
librarystringβ€”Library (transformers, diffusers, …)
languagestringβ€”ISO code (en, es, multilingual, …)
licensestringβ€”License filter (apache-2.0, mit, llama3.1, …)
sdkstringβ€”Spaces only: docker / gradio / streamlit / static
minDownloadsintegerβ€”Drop records below this download count
minLikesintegerβ€”Drop records below this likes count
sortstring enumtrendingScoretrendingScore / downloads / likes / lastModified / createdAt
maxResultsinteger100Hard cap (1–5,000)
parseTagsStructuredbooleantrueSplit flat tags into structured fields
includeAuthorProfilebooleanfalseFetch author / org profile
enrichWithGooglebooleanfalseFind website + LinkedIn + secondary emails per author
enrichLimitinteger50Max unique authors to enrich (1–1,000)
proxyConfigproxyresidentialUsed for enrichment only

πŸ’° Pricing & cost

Pay-per-event:

  • Per record returned β€” small fee, linear with results
  • Per enriched author β€” only when enrichWithGoogle: true, capped by enrichLimit

A 1,000-model pull without enrichment is essentially free. With author profile + enrichment on 50 unique authors, you stay under a few dollars per run.

The actor only billing-events when a record actually lands in the Dataset. Retries, rate-limit backoffs and partial failures are not charged.


❓ Frequently asked questions

Is this an official HuggingFace product? No. It calls the same public huggingface.co/api endpoints that the official huggingface_hub Python SDK uses. No HF_TOKEN required.

Do you respect HuggingFace terms of service? Yes. We only read public endpoints. We add polite delays and exponential backoff on 429 responses.

Can I get the model weights / dataset files themselves? No. The actor returns metadata only (which includes the file list via siblings). To download the binary files, use HuggingFace's resolver (https://huggingface.co/<id>/resolve/main/<file>) with the standard huggingface_hub SDK.

How fresh is the data? Live. Every request hits the HuggingFace Hub in real time.

Can I convert a HuggingFace dataset to CSV? The actor returns dataset metadata (including the siblings list, which is the file manifest). For the actual rows, download the Parquet / JSON / CSV files declared in siblings and convert as you wish.

What is trendingScore? HuggingFace's own hype signal. It combines recent downloads, likes, and Spaces usage to rank "what is hot right now". Useful for newsletter automation.

Why is parameters sometimes null? HuggingFace populates parameters from a model's safetensors index or config.json. Some older models or non-transformer models don't publish that, so the field is null.

How do I find the most popular HuggingFace models? Set searchType: "models", sort: "downloads", optionally pipelineTag to scope by task, and increase maxResults. For "fresh hype" use sort: "trendingScore".

Can I scrape gated models? You can read their metadata. The gated: true field tells you the weights require user acceptance. The scraper does not bypass any gating.

Does the enrichment really find author emails? Yes, when the author has published an email anywhere on their website, GitHub, LinkedIn or academic page. The SERP fetcher follows the same approach as Apollo / Hunter, applied to AI researchers and ML engineers.

Can I run this on a schedule? Yes. Apify Schedules supports cron expressions. A daily run sorted by trendingScore produces a "what's new in AI today" feed.

How does this compare to the huggingface_hub Python SDK? The SDK is great for single calls in a Python script. For bulk extraction (1K+ records), structured tags, downloads stats, author profiles and email enrichment in one run, this actor is much faster and exports straight to CSV / JSON / Excel.

Can I integrate the actor with Claude, Cursor or other AI agents? Yes β€” call the actor via the Apify API from your agent or use Apify's MCP server wrapper. I also publish dedicated MCP server actors (see below).


πŸ”— Other actors by makework36

Useful companions for the AI / ML stack:


πŸ“ Changelog

  • v0.1 β€” Initial release. Five search modes (models / datasets / spaces / papers / byIds), structured tag parsing, author profile, optional Google enrichment.

πŸ› οΈ Support

Missing a field, hit a bug, or want a new mode? Open an issue or message me directly from the Apify Console. I respond fast and ship fixes within hours for paying users.