Hugging Face Scraper — AI Models, Datasets, Spaces & Papers avatar

Hugging Face Scraper — AI Models, Datasets, Spaces & Papers

Pricing

from $2.00 / 1,000 results

Go to Apify Store
Hugging Face Scraper — AI Models, Datasets, Spaces & Papers

Hugging Face Scraper — AI Models, Datasets, Spaces & Papers

Export every AI model, dataset, space and daily paper from the Hugging Face Hub. Filter by task, library (transformers, diffusers, GGUF), language, license, author. Sort by downloads, likes, trending. Sibling files + README. Public HF API, no token. For AI builders, ML research, RAG and VC AI intel.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Hugging Face Scraper — AI Models, Datasets, Spaces & Papers Discovery

Discover and export every AI model, dataset, space, and daily research paper on the Hugging Face Hub — the world's largest open AI repository (~1M+ models, ~200k+ datasets, ~500k+ spaces). Filter by task (text-generation, embeddings, ASR, TTS, vision, etc.), library (transformers, diffusers, sentence-transformers, GGUF, MLX, ONNX), language, base model, license, author / organization. Sort by downloads, likes, recently-updated or trending.

Built on the official open Hugging Face Hub API — no token, no proxy, no scraping. Per item: full metadata, tag taxonomy, sibling files, README content, model card data and direct Hub URLs.

Perfect for AI tool builders, ML researchers, RAG / fine-tuning teams, AI model marketplaces, VC analysts tracking AI talent, and any competitive-intelligence workflow in the 2026 AI ecosystem.


🚀 What does this Hugging Face scraper do?

Five entity types — all in one normalized schema:

EntityWhat you getCatalog size
modelsModel weights + config + tokenizer + adapters~1M+
datasetsTraining, evaluation, instruction-tuning, multimodal datasets~200k+
spacesHosted Gradio / Streamlit / Docker demo apps~500k+
papersCurated daily research papers with upvotes & author lists~30 new / day
collectionsCurated lists by HF usersdynamic

Every record carries author, downloads, likes, task / pipeline tag, library, license, language tags, base-model lineage, dataset lineage, last-modified timestamp and direct Hub URL — ready for a leaderboard, monitoring dashboard, or RAG pipeline.


💡 Use cases

  • AI tool discovery / model marketplaces — daily refresh of every new model in a niche (e.g. all text-generation GGUF models above 1k downloads)
  • RAG / fine-tuning pipelines — discover every dataset matching task_categories:question-answering + language:en
  • VC / talent intel — pull every model by author=meta-llama or author=mistralai for portfolio monitoring; track which authors are accumulating likes fastest
  • AI release monitoring — alert when a new model in your watchlist is uploaded (sort=createdAt + modifiedFrom)
  • Hub indexing — feed every README into a vector DB for semantic search over the Hub
  • Competitive analysis — sort by downloads + filter by library=diffusers to map the image-generation landscape
  • Daily paper digests — pull every Hugging Face Daily Papers entry with upvotes, authors and abstracts
  • Model marketplace seeding — bulk-export every Apache-2.0 model in a category to bootstrap a model store

⚙️ Input configuration

FieldTypeDefaultDescription
entityTypestring"models"models / datasets / spaces / papers / collections.
searchstring""Free-text search over name + description.
authorstring""Author / organization filter (mistralai, meta-llama, Qwen, stabilityai).
pipelineTagstring""Models only: task tag (text-generation, automatic-speech-recognition, image-to-text, etc.).
librarystring""Library filter (transformers, diffusers, sentence-transformers, gguf, mlx, onnx).
languagestring""Language tag (en, tr, multilingual).
tagsstring[][]Extra HF tag filters (license:apache-2.0, base_model:meta-llama/Llama-3-8B).
sortstring"downloads"downloads / likes / lastModified / createdAt / trendingScore.
sortDirectionstring"-1"-1 (desc) / 1 (asc).
maxResultsinteger500Hard cap. 0 = unlimited.
fetchDetailsbooleanfalseMake one extra API call per item to fetch siblings list + cardData + full license + gated/disabled flags.
fetchReadmebooleanfalseRequires fetchDetails. Pulls the raw README.md content (model / dataset / space card).
minDownloadsinteger0Client-side filter — drop items below this download count.
minLikesinteger0Client-side filter — drop items below this like count.
modifiedFromstringnullDrop items last-modified before this date (YYYY-MM-DD).
papersStartDatestringnullPapers-only: range start. Defaults to last 30 days.
papersEndDatestringnullPapers-only: range end. Defaults to today.

📦 Output fields

Common (all entity types)

FieldDescriptionExample
entityTypemodel / dataset / space / paper / collection"model"
idFull HF ID ({author}/{name} or paper ID)"Qwen/Qwen3-0.6B"
internalIdHF's internal _id"645d.."
authorAuthor / organization"Qwen"
nameShort name (post-/)"Qwen3-0.6B"
descriptionDescription (datasets only typically)"..."
downloadsTotal downloads (all time)18506640
likesCommunity likes1248
trendingScoreHF trending algorithm score42.3
tagsFull tag list (verbatim from HF)["transformers","safetensors","qwen3","..."]
languagesLanguages parsed from tags["en","fr"]
datasetsLinked datasets (model only)["dataset:teknium/OpenHermes-2.5"]
baseModelBase-model lineage (model only)["base_model:meta-llama/Llama-3-8B"]
licenseLicense (parsed from tags or cardData)"apache-2.0"
gatedGating status (when fetchDetails)"manual" / false
privatePrivate flagfalse
disabledDisabled flagfalse
lastModifiedLast-modified timestamp"2026-05-06T22:..."
createdAtCreation timestamp"2024-..."
shaCommit SHA"3866cf9..."
urlDirect Hub URL"https://huggingface.co/Qwen/Qwen3-0.6B"
scrapedAtUTC scrape timestamp"2026-05-18T07:30:00Z"

Model-specific

FieldDescription
pipelineTagPrimary task tag
libraryNametransformers, diffusers, etc.
siblingsFile list (fetchDetails)
fileCountNumber of files
modelIndexEvaluation results (fetchDetails)
cardDataFull card data dict (fetchDetails)
readmeRaw README.md (fetchReadme)

Space-specific

FieldDescription
sdkgradio / streamlit / docker / static
spaceRuntimeRuntime status, hardware, sleep config

Paper-specific

FieldDescription
paperIdArxiv-style ID
paperAuthorsList of author names
paperSummaryAbstract
paperPublishedAtPublication date
paperUpvotesHF Daily Papers upvotes

🧪 Example inputs

1. Top 200 text-generation models, with details and READMEs

{
"entityType": "models",
"pipelineTag": "text-generation",
"sort": "downloads",
"sortDirection": "-1",
"maxResults": 200,
"fetchDetails": true,
"fetchReadme": true
}

2. Every Llama-3 fine-tune in the last 30 days

{
"entityType": "models",
"search": "llama-3",
"tags": ["base_model:meta-llama/Llama-3-8B"],
"sort": "lastModified",
"modifiedFrom": "2026-04-18",
"maxResults": 1000
}

3. Top 500 datasets for instruction tuning (English)

{
"entityType": "datasets",
"search": "instruct",
"language": "en",
"tags": ["task_categories:text-generation"],
"sort": "likes",
"maxResults": 500
}

4. Recent Daily Papers (last 14 days, top upvoted)

{
"entityType": "papers",
"papersStartDate": "2026-05-04",
"papersEndDate": "2026-05-18",
"maxResults": 200
}

5. All Gradio spaces from a specific org

{
"entityType": "spaces",
"author": "huggingface",
"sort": "likes",
"maxResults": 100
}

6. Apache-2.0 LLM models over 100k downloads, with full details

{
"entityType": "models",
"pipelineTag": "text-generation",
"tags": ["license:apache-2.0"],
"sort": "downloads",
"minDownloads": 100000,
"maxResults": 500,
"fetchDetails": true
}

7. Turkish-language models

{
"entityType": "models",
"language": "tr",
"sort": "downloads",
"maxResults": 200
}

🧠 How it works

  1. List endpointGET https://huggingface.co/api/{models|datasets|spaces}?search=&author=&pipeline_tag=&library=&language=&filter=&sort=&direction=&limit=100&cursor=...
  2. Cursor pagination — HF returns Link: <url>; rel="next" headers; the actor extracts the cursor param and walks forward until the result cap or exhaustion.
  3. Detail enrichmentGET https://huggingface.co/api/{type}/{id} returns sibling file list, cardData, modelIndex, spaceRuntime, gated/disabled flags.
  4. README fetchGET https://huggingface.co/{id}/raw/main/README.md (or /datasets/{id}/... for datasets).
  5. Daily PapersGET https://huggingface.co/api/daily_papers?date=YYYY-MM-DD per day in the requested range; merged into a single dataset.
  6. Normalization — every entity is mapped to the same flat record shape so cross-entity analytics are trivial.

No authentication. The Hugging Face Hub API is intentionally open.


🛑 Limits & notes

  • Hub size is enormous. Listing all models (~1M) at the default page size takes ~10k API calls. Use filters aggressively.
  • gated:auto / gated:manual items return metadata but file downloads from siblings require an HF token (out of scope here).
  • private items are not returned by the public API.
  • HF rate limits: ~600 req/min/IP for anonymous calls in normal conditions. The actor uses retry with backoff.
  • Daily Papers archive goes back to ~2023-10. Older dates return empty arrays.
  • fetchReadme can be heavy (each README is 5–100KB). For large runs, disable it and pull readmes only for the top items.

💰 Pricing

Monetized via pay-per-event on Apify — pay per item saved. Hugging Face API is free.


❓ FAQ

Does this download the actual model weights? No — only metadata. The url and siblings fields give you direct download URLs (https://huggingface.co/{id}/resolve/main/{file}) which you can fetch downstream.

Can I get model evaluation benchmarks (MMLU, ARC, etc.)? Yes — enable fetchDetails; modelIndex returns the structured evaluation results when the author has published them.

Does it cover Hugging Face Inference Endpoints / Inference API status? No — those are gated/paid features. This actor is read-only Hub catalog data.

Can I scrape user / organization profiles? Use author=<name> as a filter — every model/dataset/space from that org is enumerated. For dedicated user-page data, open an issue.

How is this different from the official huggingface_hub Python library? The library is a thin client for the same endpoints. This actor adds: cross-entity normalization, pagination guardrails, README/sibling enrichment, dataset/collection mode, optional download/like floors, and Apify-native output (CSV/Excel/JSONL export, scheduling, webhook integrations).

Will it work for private/enterprise models? No — token-based scraping is out of scope. The actor targets the public catalog only.


  • logiover/github-repository-scraper — combine with HF authors to find their GitHub orgs
  • logiover/substack-newsletter-scraper — track which AI newsletter authors also publish on HF
  • logiover/apple-podcasts-episode-scraper — find AI podcasts and join with HF model author names
  • logiover/sitemap-to-url-crawler — crawl an HF author's portfolio website for full attribution

🆘 Support

Need user profiles, organization members, or HF Inference Endpoint status? Open an issue on the actor's Apify page.


Changelog

  • 2026-05-20 — Maintenance pass: reviewed the input schema and default values for a smooth one-click start, and rebuilt the Actor on the latest base image.

Last reviewed: 2026-05-20.