Hugging Face Scraper — AI Models, Datasets, Spaces & Papers
Pricing
from $2.00 / 1,000 results
Hugging Face Scraper — AI Models, Datasets, Spaces & Papers
Export every AI model, dataset, space and daily paper from the Hugging Face Hub. Filter by task, library (transformers, diffusers, GGUF), language, license, author. Sort by downloads, likes, trending. Sibling files + README. Public HF API, no token. For AI builders, ML research, RAG and VC AI intel.
Pricing
from $2.00 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Hugging Face Scraper — AI Models, Datasets, Spaces & Papers Discovery
Discover and export every AI model, dataset, space, and daily research paper on the Hugging Face Hub — the world's largest open AI repository (~1M+ models, ~200k+ datasets, ~500k+ spaces). Filter by task (text-generation, embeddings, ASR, TTS, vision, etc.), library (transformers, diffusers, sentence-transformers, GGUF, MLX, ONNX), language, base model, license, author / organization. Sort by downloads, likes, recently-updated or trending.
Built on the official open Hugging Face Hub API — no token, no proxy, no scraping. Per item: full metadata, tag taxonomy, sibling files, README content, model card data and direct Hub URLs.
Perfect for AI tool builders, ML researchers, RAG / fine-tuning teams, AI model marketplaces, VC analysts tracking AI talent, and any competitive-intelligence workflow in the 2026 AI ecosystem.
🚀 What does this Hugging Face scraper do?
Five entity types — all in one normalized schema:
| Entity | What you get | Catalog size |
|---|---|---|
models | Model weights + config + tokenizer + adapters | ~1M+ |
datasets | Training, evaluation, instruction-tuning, multimodal datasets | ~200k+ |
spaces | Hosted Gradio / Streamlit / Docker demo apps | ~500k+ |
papers | Curated daily research papers with upvotes & author lists | ~30 new / day |
collections | Curated lists by HF users | dynamic |
Every record carries author, downloads, likes, task / pipeline tag, library, license, language tags, base-model lineage, dataset lineage, last-modified timestamp and direct Hub URL — ready for a leaderboard, monitoring dashboard, or RAG pipeline.
💡 Use cases
- AI tool discovery / model marketplaces — daily refresh of every new model in a niche (e.g. all
text-generationGGUF models above 1k downloads) - RAG / fine-tuning pipelines — discover every dataset matching
task_categories:question-answering+language:en - VC / talent intel — pull every model by
author=meta-llamaorauthor=mistralaifor portfolio monitoring; track which authors are accumulating likes fastest - AI release monitoring — alert when a new model in your watchlist is uploaded (
sort=createdAt+modifiedFrom) - Hub indexing — feed every README into a vector DB for semantic search over the Hub
- Competitive analysis — sort by downloads + filter by
library=diffusersto map the image-generation landscape - Daily paper digests — pull every Hugging Face Daily Papers entry with upvotes, authors and abstracts
- Model marketplace seeding — bulk-export every Apache-2.0 model in a category to bootstrap a model store
⚙️ Input configuration
| Field | Type | Default | Description |
|---|---|---|---|
entityType | string | "models" | models / datasets / spaces / papers / collections. |
search | string | "" | Free-text search over name + description. |
author | string | "" | Author / organization filter (mistralai, meta-llama, Qwen, stabilityai). |
pipelineTag | string | "" | Models only: task tag (text-generation, automatic-speech-recognition, image-to-text, etc.). |
library | string | "" | Library filter (transformers, diffusers, sentence-transformers, gguf, mlx, onnx). |
language | string | "" | Language tag (en, tr, multilingual). |
tags | string[] | [] | Extra HF tag filters (license:apache-2.0, base_model:meta-llama/Llama-3-8B). |
sort | string | "downloads" | downloads / likes / lastModified / createdAt / trendingScore. |
sortDirection | string | "-1" | -1 (desc) / 1 (asc). |
maxResults | integer | 500 | Hard cap. 0 = unlimited. |
fetchDetails | boolean | false | Make one extra API call per item to fetch siblings list + cardData + full license + gated/disabled flags. |
fetchReadme | boolean | false | Requires fetchDetails. Pulls the raw README.md content (model / dataset / space card). |
minDownloads | integer | 0 | Client-side filter — drop items below this download count. |
minLikes | integer | 0 | Client-side filter — drop items below this like count. |
modifiedFrom | string | null | Drop items last-modified before this date (YYYY-MM-DD). |
papersStartDate | string | null | Papers-only: range start. Defaults to last 30 days. |
papersEndDate | string | null | Papers-only: range end. Defaults to today. |
📦 Output fields
Common (all entity types)
| Field | Description | Example |
|---|---|---|
entityType | model / dataset / space / paper / collection | "model" |
id | Full HF ID ({author}/{name} or paper ID) | "Qwen/Qwen3-0.6B" |
internalId | HF's internal _id | "645d.." |
author | Author / organization | "Qwen" |
name | Short name (post-/) | "Qwen3-0.6B" |
description | Description (datasets only typically) | "..." |
downloads | Total downloads (all time) | 18506640 |
likes | Community likes | 1248 |
trendingScore | HF trending algorithm score | 42.3 |
tags | Full tag list (verbatim from HF) | ["transformers","safetensors","qwen3","..."] |
languages | Languages parsed from tags | ["en","fr"] |
datasets | Linked datasets (model only) | ["dataset:teknium/OpenHermes-2.5"] |
baseModel | Base-model lineage (model only) | ["base_model:meta-llama/Llama-3-8B"] |
license | License (parsed from tags or cardData) | "apache-2.0" |
gated | Gating status (when fetchDetails) | "manual" / false |
private | Private flag | false |
disabled | Disabled flag | false |
lastModified | Last-modified timestamp | "2026-05-06T22:..." |
createdAt | Creation timestamp | "2024-..." |
sha | Commit SHA | "3866cf9..." |
url | Direct Hub URL | "https://huggingface.co/Qwen/Qwen3-0.6B" |
scrapedAt | UTC scrape timestamp | "2026-05-18T07:30:00Z" |
Model-specific
| Field | Description |
|---|---|
pipelineTag | Primary task tag |
libraryName | transformers, diffusers, etc. |
siblings | File list (fetchDetails) |
fileCount | Number of files |
modelIndex | Evaluation results (fetchDetails) |
cardData | Full card data dict (fetchDetails) |
readme | Raw README.md (fetchReadme) |
Space-specific
| Field | Description |
|---|---|
sdk | gradio / streamlit / docker / static |
spaceRuntime | Runtime status, hardware, sleep config |
Paper-specific
| Field | Description |
|---|---|
paperId | Arxiv-style ID |
paperAuthors | List of author names |
paperSummary | Abstract |
paperPublishedAt | Publication date |
paperUpvotes | HF Daily Papers upvotes |
🧪 Example inputs
1. Top 200 text-generation models, with details and READMEs
{"entityType": "models","pipelineTag": "text-generation","sort": "downloads","sortDirection": "-1","maxResults": 200,"fetchDetails": true,"fetchReadme": true}
2. Every Llama-3 fine-tune in the last 30 days
{"entityType": "models","search": "llama-3","tags": ["base_model:meta-llama/Llama-3-8B"],"sort": "lastModified","modifiedFrom": "2026-04-18","maxResults": 1000}
3. Top 500 datasets for instruction tuning (English)
{"entityType": "datasets","search": "instruct","language": "en","tags": ["task_categories:text-generation"],"sort": "likes","maxResults": 500}
4. Recent Daily Papers (last 14 days, top upvoted)
{"entityType": "papers","papersStartDate": "2026-05-04","papersEndDate": "2026-05-18","maxResults": 200}
5. All Gradio spaces from a specific org
{"entityType": "spaces","author": "huggingface","sort": "likes","maxResults": 100}
6. Apache-2.0 LLM models over 100k downloads, with full details
{"entityType": "models","pipelineTag": "text-generation","tags": ["license:apache-2.0"],"sort": "downloads","minDownloads": 100000,"maxResults": 500,"fetchDetails": true}
7. Turkish-language models
{"entityType": "models","language": "tr","sort": "downloads","maxResults": 200}
🧠 How it works
- List endpoint →
GET https://huggingface.co/api/{models|datasets|spaces}?search=&author=&pipeline_tag=&library=&language=&filter=&sort=&direction=&limit=100&cursor=... - Cursor pagination — HF returns
Link: <url>; rel="next"headers; the actor extracts thecursorparam and walks forward until the result cap or exhaustion. - Detail enrichment →
GET https://huggingface.co/api/{type}/{id}returns sibling file list, cardData, modelIndex, spaceRuntime, gated/disabled flags. - README fetch →
GET https://huggingface.co/{id}/raw/main/README.md(or/datasets/{id}/...for datasets). - Daily Papers →
GET https://huggingface.co/api/daily_papers?date=YYYY-MM-DDper day in the requested range; merged into a single dataset. - Normalization — every entity is mapped to the same flat record shape so cross-entity analytics are trivial.
No authentication. The Hugging Face Hub API is intentionally open.
🛑 Limits & notes
- Hub size is enormous. Listing all models (~1M) at the default page size takes ~10k API calls. Use filters aggressively.
gated:auto/gated:manualitems return metadata but file downloads fromsiblingsrequire an HF token (out of scope here).privateitems are not returned by the public API.- HF rate limits: ~600 req/min/IP for anonymous calls in normal conditions. The actor uses retry with backoff.
- Daily Papers archive goes back to ~2023-10. Older dates return empty arrays.
fetchReadmecan be heavy (each README is 5–100KB). For large runs, disable it and pull readmes only for the top items.
💰 Pricing
Monetized via pay-per-event on Apify — pay per item saved. Hugging Face API is free.
❓ FAQ
Does this download the actual model weights?
No — only metadata. The url and siblings fields give you direct download URLs (https://huggingface.co/{id}/resolve/main/{file}) which you can fetch downstream.
Can I get model evaluation benchmarks (MMLU, ARC, etc.)?
Yes — enable fetchDetails; modelIndex returns the structured evaluation results when the author has published them.
Does it cover Hugging Face Inference Endpoints / Inference API status? No — those are gated/paid features. This actor is read-only Hub catalog data.
Can I scrape user / organization profiles?
Use author=<name> as a filter — every model/dataset/space from that org is enumerated. For dedicated user-page data, open an issue.
How is this different from the official huggingface_hub Python library?
The library is a thin client for the same endpoints. This actor adds: cross-entity normalization, pagination guardrails, README/sibling enrichment, dataset/collection mode, optional download/like floors, and Apify-native output (CSV/Excel/JSONL export, scheduling, webhook integrations).
Will it work for private/enterprise models? No — token-based scraping is out of scope. The actor targets the public catalog only.
🔗 Related actors
logiover/github-repository-scraper— combine with HF authors to find their GitHub orgslogiover/substack-newsletter-scraper— track which AI newsletter authors also publish on HFlogiover/apple-podcasts-episode-scraper— find AI podcasts and join with HF model author nameslogiover/sitemap-to-url-crawler— crawl an HF author's portfolio website for full attribution
🆘 Support
Need user profiles, organization members, or HF Inference Endpoint status? Open an issue on the actor's Apify page.
Changelog
- 2026-05-20 — Maintenance pass: reviewed the input schema and default values for a smooth one-click start, and rebuilt the Actor on the latest base image.
Last reviewed: 2026-05-20.