HuggingFace Scraper — Models, Datasets & Spaces
Pricing
Pay per event
HuggingFace Scraper — Models, Datasets & Spaces
Export models, datasets, and Spaces from the HuggingFace Hub API — filter by task, library, or author, with a trending snapshot mode — to JSON or CSV. Richer schema than incumbents: downloads, likes, tags, license, last-modified. No login.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
0
Monthly active users
11 days ago
Last modified
Categories
Share
HuggingFace Scraper — Models, Datasets & Spaces
We do the dirty work so your dataset stays clean. 😈
$2.05 / 1,000 rows — pay only for results that land. No credit card to try.
Export structured metadata for models, datasets, and Spaces from the HuggingFace Hub. Filter by task tag, library, author, or free-text search. Trending snapshot mode included. One Actor handles all three repo types; Pydantic-validated rows land in a dataset you can download as JSON, CSV, Excel, or XML.
🎯 What this scrapes
Three repo types, one HuggingFace scraper:
- Models — downloads, likes, pipeline tag, library name, tags, safetensors parameter count, GGUF file detection, and size-category bucketing.
- Datasets — task categories, size categories, and language codes parsed from tag prefixes.
- Spaces — SDK name and runtime stage (RUNNING / SLEEPING / STOPPED) in detail mode.
Five filter modes, mutually exclusive — use at most one per run:
- Trending snapshot — leave all filters blank to capture the top-N repos by
downloads,likes,trending,last_modified, orcreated_at. - Tag filter — pass
filterTagsto restrict to repos carrying specific tags (e.g.text-generation). - Search query — pass
searchQueryfor free-text search across repo metadata. - Author — pass
authorfor every public repo from one org or user (e.g.openai). - Single-repo deep fetch — pass
repoIdinowner/nameform to call only the detail endpoint and emit one richly-enriched row.
🔥 Features
- Three repo types in one Actor:
model,dataset,space— pick via therepoTypeselector. - Five sort fields:
downloads,likes,trending,last_modified,created_at(all descending). - Five filter modes: tag list, free-text search, author/org, single-repo deep fetch, or no filter (trending snapshot).
- Optional
includeDetailsmode — calls the per-repo detail endpoint to enrich with safetensors parameter counts, GGUF file detection, and Space runtime stage. - GGUF detection flag and derived
model_size_categorybucket (<1B / 1B-7B / 7B-13B / 13B+) ready for downstream pricing or hardware-fit dashboards. - Dataset tag prefixes auto-parsed into structured arrays:
size_categories:,task_categories:,language:. - Pydantic v2 input validation — at most one filter may be set; invalid input fails fast with a clear error before any network call.
- Exponential backoff on
429and503; honoursRetry-After; max 5 attempts per endpoint call. - Browser fingerprint impersonation via
curl-cffi— no scraper-detectable headers leave the Actor. - Companion to
llm-pricing-monitoras part of the AI Stack Intelligence suite.
💡 Use cases
- AI researcher trend tracking — pull the trending top-100 models weekly and feed a time-series dashboard tracking which model families dominate the Hub.
- Investor adoption monitoring — measure download velocity for specific model families (
pipeline_tag=text-generation+library_name=transformers) to inform AI infrastructure investment theses. - Fine-tuner catalog survey — enumerate every model under a
pipeline_taglikeimage-segmentationto map the open-weights landscape before choosing a base model. - Dataset discovery — filter datasets by
task_categoriesandlanguageto find labelled training corpora for a downstream NLP/vision/audio model. - Hardware-fit analysis — use
safetensors_total_paramsandmodel_size_categoryto filter models that fit a target memory budget before benchmarking. - GGUF availability tracking — set
includeDetails=trueand filter onhas_gguf=trueto find quantized inference-ready models for llama.cpp or LM Studio. - Space monitoring — capture which Spaces are
RUNNINGvsSLEEPINGfor a creator or topic, useful for community health dashboards. - Content creator coverage — feed every model from a popular org like
openaiormeta-llamainto a content pipeline for blog posts or YouTube videos. - Track HuggingFace model downloads over time — schedule recurring runs to build a time-series of download counts and detect fast-movers before they hit mainstream coverage.
⚙️ How to use it
- Open the Actor input form.
- Pick Repo type —
model,dataset, orspace. - Optionally pick a Sort field (default
downloads). - Set at most one filter:
filterTags,searchQuery,author, orrepoId. Leave all four blank for a trending snapshot. - Adjust Max results (default 100, maximum 5000). Ignored in
repoIdmode (always 1 row). - Toggle Include detail enrichment on if you need safetensors parameter counts, GGUF detection, or Space runtime stage. Detail mode is charged at $0.005/row instead of $0.002/row.
- Click Start and watch the run log. Results stream into the default dataset and can be downloaded as JSON, CSV, Excel, or XML via the Export button.
What we handle for you
We absorb every failure mode you'd otherwise hit running this yourself:
- We rotate browser fingerprints —
curl-cffiimpersonates real browser TLS signatures (Chrome / Firefox) so our requests look like genuine browser traffic. - We retry with exponential backoff on
408 / 429 / 503and honourRetry-After. Up to 5 attempts per page before surfacing a partial-success status. - We rotate proxies through Apify Proxy on blocks — fresh session ID, fresh exit IP, back on track.
- We back off when the target rate-limits — partial successes surface with a clear status message; we never silently return an empty dataset.
- We keep your dataset clean — Pydantic-validated rows, ISO-8601 timestamps, stable repo IDs,
nullfor absent optional fields rather than missing keys. - You pay only for results that land — if no data comes back, you're not charged for rows (only the small Actor-start warm-up fee).
Quick examples
Trending top-100 models, list mode:
{"repoType": "model","sort": "downloads","maxResults": 100,"includeDetails": false}
Every transformers text-generation model with safetensors and GGUF detection:
{"repoType": "model","filterTags": ["text-generation", "transformers"],"maxResults": 500,"includeDetails": true}
Single-repo deep fetch:
{"repoType": "model","repoId": "openai/whisper-large-v3"}
HuggingFace dataset export by language:
{"repoType": "dataset","filterTags": ["language:fr"],"maxResults": 200,"includeDetails": false}
📥 Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
repoType | string | yes | — | model, dataset, or space |
sort | string | no | downloads | One of downloads, likes, last_modified, created_at, trending |
filterTags | array | one-of | — | Tag filter; joined with comma for HF filter= param |
searchQuery | string | one-of | — | Free-text search via HF search= |
author | string | one-of | — | Org or user slug via HF author= |
repoId | string | one-of | — | Single owner/name deep-fetch; forces detail mode |
maxResults | integer | no | 100 | Max rows emitted (1 to 5000) |
includeDetails | boolean | no | false | Per-row detail enrichment |
proxyConfiguration | object | no | — | Apify Proxy configuration (recommended for high-volume runs) |
At most one of filterTags, searchQuery, author, repoId may be non-null. Setting two or more raises a Pydantic validation error before any network call.
📤 Output
One row per repo. Type-specific fields are null for repo types where they don't apply.
| Field | Type | Description |
|---|---|---|
repo_type | string | One of model, dataset, space |
repo_id | string | HuggingFace repo identifier in owner/name form |
repo_owner | string | null | Owning org or user slug |
repo_name | string | Last segment of repo_id |
repo_url | string | https://huggingface.co/{repo_id} |
downloads | integer | null | 30-day rolling download count (always null for Spaces) |
likes | integer | Like count |
created_at | string | ISO 8601 creation timestamp |
last_modified | string | null | ISO 8601 last-modified timestamp |
tags | array | Repo tags (may be empty) |
gated | boolean | null | manual → true, false → false, absent → null |
private | boolean | Always false for public catalog entries |
pipeline_tag | string | null | Model pipeline tag (models only) |
library_name | string | null | Model library name (models only) |
safetensors_total_params | integer | null | Total safetensors parameter count (detail mode, models only) |
model_size_category | string | null | Bucket label <1B / 1B-7B / 7B-13B / 13B+ |
has_gguf | boolean | null | True if any sibling filename ends .gguf (detail mode) |
dataset_size_categories | array | null | Stripped from size_categories: tag prefix (datasets only) |
dataset_task_categories | array | null | Stripped from task_categories: tag prefix (datasets only) |
dataset_languages | array | null | Stripped from language: tag prefix (datasets only) |
space_sdk | string | null | Space SDK (gradio, streamlit, docker, static) |
space_runtime_stage | string | null | Space runtime stage (detail mode, Spaces only) |
scraped_at | string | ISO 8601 UTC datetime this row was written |
{"repo_type": "model","repo_id": "openai/whisper-large-v3","repo_owner": "openai","repo_name": "whisper-large-v3","repo_url": "https://huggingface.co/openai/whisper-large-v3","downloads": 4932732,"likes": 5690,"created_at": "2023-11-07T18:41:14.000Z","last_modified": "2024-08-12T10:20:10.000Z","tags": ["transformers", "safetensors", "whisper", "automatic-speech-recognition"],"gated": false,"private": false,"pipeline_tag": "automatic-speech-recognition","library_name": "transformers","safetensors_total_params": 1543490560,"model_size_category": "1B-7B","has_gguf": false,"dataset_size_categories": null,"dataset_task_categories": null,"dataset_languages": null,"space_sdk": null,"space_runtime_stage": null,"scraped_at": "2026-05-16T12:00:00+00:00"}
Optional fields are emitted as null when the API does not return them. Rows are never dropped for missing optional fields.
Export formats
After a run completes, click Export in the Apify Console to download:
- JSON — full fidelity, all fields, newline-delimited
- CSV — flat, one row per repo
- Excel —
.xlsxvia the Apify dataset converter - XML — structured per-item
All formats are also available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.
💰 Pricing
Pay-Per-Event (PPE) — you pay only for what you use:
| Event | Price (USD) | When |
|---|---|---|
actor-start | $0.05 | Once per run, at boot |
result-row | $0.002 | Per repo row written in list mode (includeDetails=false) |
result-row-detailed | $0.005 | Per repo row written in detail mode or repoId mode |
Example costs
| Rows scraped | Mode | Actor starts | Total cost |
|---|---|---|---|
| 100 | list | 1 | $0.25 |
| 500 | list | 1 | $1.05 |
| 1,000 | list | 1 | $2.05 |
| 1,000 | detail | 1 | $5.05 |
| 5,000 | list | 1 | $10.05 |
| 5,000 | detail | 1 | $25.05 |
Honest pricing, no fine print. Consistent with the llm-pricing-monitor companion Actor so the AI Stack Intelligence suite bills at a uniform rate.
🚧 Limitations
- Private and gated repos are not accessible. The unauthenticated public API only returns publicly visible data.
- Rate limit: 500 requests per 5 minutes (verified 2026-05-16 via
ratelimit-policyheader). At default page size 100, this allows ~10,000 list rows per 5-min window. Detail mode adds one request per row, halving throughput to ~250 enriched rows per minute. - Spaces never have a
downloadsmetric. The field is alwaysnullforrepo_type=space— verified on both list and detail endpoints. - Sparse Spaces list:
repo_owner,last_modified, andspace_runtime_stagerequire detail mode for Spaces. - Safetensors and GGUF fields need detail mode.
safetensors_total_params,model_size_category, andhas_ggufare only populated whenincludeDetails=true. - No cross-run deduplication. Re-running the same input returns the same repos with refreshed metadata. Use a downstream dedupe pass if you need uniqueness across runs.
- No model card or dataset card markdown content. Only structured metadata fields; the README body is excluded as too noisy for a structured dataset.
- No HuggingFace Inference API calls or model benchmarking. This Actor only scrapes catalog metadata, not model outputs.
- The Apify FREE tier retains run-scoped storage for 7 days only. For longer retention, export your dataset immediately or upgrade to a paid Apify plan.
Tips for best results
- Use a trending snapshot weekly to track the rapidly-evolving model leaderboard. Set up an Apify Schedule for a recurring run.
- Cap
maxResultsto what you actually need. The HF Hub has 1M+ models; setting a sensible cap keeps cost and runtime predictable. - Use detail mode sparingly. It is 2.5x the per-row cost and roughly 4x the per-row latency. Prefer list mode for catalog snapshots; flip to detail mode only when you need safetensors, GGUF, or runtime stage.
- Combine with
llm-pricing-monitorto correlate open-weights releases on the Hub with hosted-API price moves.
❓ FAQ
Do I need a HuggingFace account or API token?
No account and no API token are required to run this scraper. The HuggingFace Hub exposes read access to public repo metadata via a public REST API. This Actor uses that interface only — it never accesses gated content, never submits content, and never touches private repos.
What is the difference between list mode and detail mode?
List mode (includeDetails=false, $0.002/row) makes one API request per page of 100 rows and returns repo_id, owner, downloads, likes, tags, pipeline_tag, library_name, gated flag, and timestamps. Detail mode (includeDetails=true, $0.005/row) additionally calls the per-repo endpoint to fetch safetensors parameter counts, GGUF file detection, Space runtime stage, and dataset card data. Use detail mode when you need those enriched fields; otherwise list mode gives you 2.5x cheaper rows and faster throughput.
How do I get trending models — huggingface trending models export?
Leave all four filter fields (filterTags, searchQuery, author, repoId) blank, keep the default sort=downloads, and set maxResults to the top-N you want (e.g. 100). To use the HF native "trending" score instead of download count, set sort=trending.
Can I do a HuggingFace dataset export or scrape Spaces too?
Yes. Switch the repoType input to dataset or space. The same filtering, pagination, and detail-mode features apply across all three repo types. Note that Spaces never carry a downloads count and need detail mode for repo_owner / last_modified / runtime_stage.
Why are some fields null on my rows?
HuggingFace's list endpoint returns different field sets for different repo types. Space list rows lack repo_owner, last_modified, and runtime entirely — detail mode populates them. The null values are accurate, not bugs.
What is GGUF and why detect it?
GGUF (GPT-Generated Unified Format) is the quantized model file format used by llama.cpp, LM Studio, and Ollama for local CPU inference. When a model repo includes a .gguf sibling file, it can run on consumer hardware without a GPU. Set includeDetails=true and filter your dataset on has_gguf=true to list all inference-ready open-weights models.
Is this an HuggingFace API wrapper — how does it differ from huggingface_hub?
The official huggingface_hub Python library is excellent for one-off queries inside your own code. This Actor is designed for the batched, scheduled, cross-author snapshot use case: large paginated exports, recurring runs on Apify Schedules, clean CSV/JSON for BI tools and dashboards — without writing or maintaining any scraping infrastructure yourself.
Is scraping the HuggingFace Hub legal?
This Actor only reads publicly visible repository metadata via the documented public API. It never bypasses authentication, never accesses gated content, and never submits content. Always verify the current Terms of Service at huggingface.co/terms-of-service and your local data-protection rules before using scraped data commercially.
💬 Your feedback
Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.