Pricing

Pay per event

HuggingFace Scraper — Models, Datasets & Spaces

Export models, datasets, and Spaces from the HuggingFace Hub API — filter by task, library, or author, with a trending snapshot mode — to JSON or CSV. Richer schema than incumbents: downloads, likes, tags, license, last-modified. No login.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

HuggingFace Scraper — Models, Datasets & Spaces

We do the dirty work so your dataset stays clean. 😈

$2.05 / 1,000 rows — pay only for results that land. No credit card to try.

Export structured metadata for models, datasets, and Spaces from the HuggingFace Hub. Filter by task tag, library, author, or free-text search. Trending snapshot mode included. One Actor handles all three repo types; Pydantic-validated rows land in a dataset you can download as JSON, CSV, Excel, or XML.

🎯 What this scrapes

Three repo types, one HuggingFace scraper:

Models — downloads, likes, pipeline tag, library name, tags, safetensors parameter count, GGUF file detection, and size-category bucketing.
Datasets — task categories, size categories, and language codes parsed from tag prefixes.
Spaces — SDK name and runtime stage (RUNNING / SLEEPING / STOPPED) in detail mode.

Five filter modes, mutually exclusive — use at most one per run:

Trending snapshot — leave all filters blank to capture the top-N repos by downloads, likes, trending, last_modified, or created_at.
Tag filter — pass filterTags to restrict to repos carrying specific tags (e.g. text-generation).
Search query — pass searchQuery for free-text search across repo metadata.
Author — pass author for every public repo from one org or user (e.g. openai).
Single-repo deep fetch — pass repoId in owner/name form to call only the detail endpoint and emit one richly-enriched row.

🔥 Features

Three repo types in one Actor: model, dataset, space — pick via the repoType selector.
Five sort fields: downloads, likes, trending, last_modified, created_at (all descending).
Five filter modes: tag list, free-text search, author/org, single-repo deep fetch, or no filter (trending snapshot).
Optional includeDetails mode — calls the per-repo detail endpoint to enrich with safetensors parameter counts, GGUF file detection, and Space runtime stage.
GGUF detection flag and derived model_size_category bucket (<1B / 1B-7B / 7B-13B / 13B+) ready for downstream pricing or hardware-fit dashboards.
Dataset tag prefixes auto-parsed into structured arrays: size_categories:, task_categories:, language:.
Pydantic v2 input validation — at most one filter may be set; invalid input fails fast with a clear error before any network call.
Exponential backoff on 429 and 503; honours Retry-After; max 5 attempts per endpoint call.
Browser fingerprint impersonation via curl-cffi — no scraper-detectable headers leave the Actor.
Companion to llm-pricing-monitor as part of the AI Stack Intelligence suite.

💡 Use cases

AI researcher trend tracking — pull the trending top-100 models weekly and feed a time-series dashboard tracking which model families dominate the Hub.
Investor adoption monitoring — measure download velocity for specific model families (pipeline_tag=text-generation + library_name=transformers) to inform AI infrastructure investment theses.
Fine-tuner catalog survey — enumerate every model under a pipeline_tag like image-segmentation to map the open-weights landscape before choosing a base model.
Dataset discovery — filter datasets by task_categories and language to find labelled training corpora for a downstream NLP/vision/audio model.
Hardware-fit analysis — use safetensors_total_params and model_size_category to filter models that fit a target memory budget before benchmarking.
GGUF availability tracking — set includeDetails=true and filter on has_gguf=true to find quantized inference-ready models for llama.cpp or LM Studio.
Space monitoring — capture which Spaces are RUNNING vs SLEEPING for a creator or topic, useful for community health dashboards.
Content creator coverage — feed every model from a popular org like openai or meta-llama into a content pipeline for blog posts or YouTube videos.
Track HuggingFace model downloads over time — schedule recurring runs to build a time-series of download counts and detect fast-movers before they hit mainstream coverage.

⚙️ How to use it

Open the Actor input form.
Pick Repo type — model, dataset, or space.
Optionally pick a Sort field (default downloads).
Set at most one filter: filterTags, searchQuery, author, or repoId. Leave all four blank for a trending snapshot.
Adjust Max results (default 100, maximum 5000). Ignored in repoId mode (always 1 row).
Toggle Include detail enrichment on if you need safetensors parameter counts, GGUF detection, or Space runtime stage. Detail mode is charged at $0.005/row instead of $0.002/row.
Click Start and watch the run log. Results stream into the default dataset and can be downloaded as JSON, CSV, Excel, or XML via the Export button.

What we handle for you

We absorb every failure mode you'd otherwise hit running this yourself:

We rotate browser fingerprints — curl-cffi impersonates real browser TLS signatures (Chrome / Firefox) so our requests look like genuine browser traffic.
We retry with exponential backoff on 408 / 429 / 503 and honour Retry-After. Up to 5 attempts per page before surfacing a partial-success status.
We rotate proxies through Apify Proxy on blocks — fresh session ID, fresh exit IP, back on track.
We back off when the target rate-limits — partial successes surface with a clear status message; we never silently return an empty dataset.
We keep your dataset clean — Pydantic-validated rows, ISO-8601 timestamps, stable repo IDs, null for absent optional fields rather than missing keys.
You pay only for results that land — if no data comes back, you're not charged for rows (only the small Actor-start warm-up fee).

Quick examples

Trending top-100 models, list mode:

{
  "repoType": "model",
  "sort": "downloads",
  "maxResults": 100,
  "includeDetails": false
}

Every transformers text-generation model with safetensors and GGUF detection:

{
  "repoType": "model",
  "filterTags": ["text-generation", "transformers"],
  "maxResults": 500,
  "includeDetails": true
}

Single-repo deep fetch:

{
  "repoType": "model",
  "repoId": "openai/whisper-large-v3"
}

HuggingFace dataset export by language:

{
  "repoType": "dataset",
  "filterTags": ["language:fr"],
  "maxResults": 200,
  "includeDetails": false
}

📥 Input

Field	Type	Required	Default	Description
`repoType`	string	yes	—	`model`, `dataset`, or `space`
`sort`	string	no	`downloads`	One of `downloads`, `likes`, `last_modified`, `created_at`, `trending`
`filterTags`	array	one-of	—	Tag filter; joined with comma for HF `filter=` param
`searchQuery`	string	one-of	—	Free-text search via HF `search=`
`author`	string	one-of	—	Org or user slug via HF `author=`
`repoId`	string	one-of	—	Single `owner/name` deep-fetch; forces detail mode
`maxResults`	integer	no	`100`	Max rows emitted (1 to 5000)
`includeDetails`	boolean	no	`false`	Per-row detail enrichment
`proxyConfiguration`	object	no	—	Apify Proxy configuration (recommended for high-volume runs)

At most one of filterTags, searchQuery, author, repoId may be non-null. Setting two or more raises a Pydantic validation error before any network call.

📤 Output

One row per repo. Type-specific fields are null for repo types where they don't apply.

Field	Type	Description
`repo_type`	string	One of `model`, `dataset`, `space`
`repo_id`	string	HuggingFace repo identifier in `owner/name` form
`repo_owner`	string \| null	Owning org or user slug
`repo_name`	string	Last segment of `repo_id`
`repo_url`	string	`https://huggingface.co/{repo_id}`
`downloads`	integer \| null	30-day rolling download count (always null for Spaces)
`likes`	integer	Like count
`created_at`	string	ISO 8601 creation timestamp
`last_modified`	string \| null	ISO 8601 last-modified timestamp
`tags`	array	Repo tags (may be empty)
`gated`	boolean \| null	`manual` → true, false → false, absent → null
`private`	boolean	Always false for public catalog entries
`pipeline_tag`	string \| null	Model pipeline tag (models only)
`library_name`	string \| null	Model library name (models only)
`safetensors_total_params`	integer \| null	Total safetensors parameter count (detail mode, models only)
`model_size_category`	string \| null	Bucket label `<1B` / `1B-7B` / `7B-13B` / `13B+`
`has_gguf`	boolean \| null	True if any sibling filename ends `.gguf` (detail mode)
`dataset_size_categories`	array \| null	Stripped from `size_categories:` tag prefix (datasets only)
`dataset_task_categories`	array \| null	Stripped from `task_categories:` tag prefix (datasets only)
`dataset_languages`	array \| null	Stripped from `language:` tag prefix (datasets only)
`space_sdk`	string \| null	Space SDK (`gradio`, `streamlit`, `docker`, `static`)
`space_runtime_stage`	string \| null	Space runtime stage (detail mode, Spaces only)
`scraped_at`	string	ISO 8601 UTC datetime this row was written

{
  "repo_type": "model",
  "repo_id": "openai/whisper-large-v3",
  "repo_owner": "openai",
  "repo_name": "whisper-large-v3",
  "repo_url": "https://huggingface.co/openai/whisper-large-v3",
  "downloads": 4932732,
  "likes": 5690,
  "created_at": "2023-11-07T18:41:14.000Z",
  "last_modified": "2024-08-12T10:20:10.000Z",
  "tags": ["transformers", "safetensors", "whisper", "automatic-speech-recognition"],
  "gated": false,
  "private": false,
  "pipeline_tag": "automatic-speech-recognition",
  "library_name": "transformers",
  "safetensors_total_params": 1543490560,
  "model_size_category": "1B-7B",
  "has_gguf": false,
  "dataset_size_categories": null,
  "dataset_task_categories": null,
  "dataset_languages": null,
  "space_sdk": null,
  "space_runtime_stage": null,
  "scraped_at": "2026-05-16T12:00:00+00:00"
}

Optional fields are emitted as null when the API does not return them. Rows are never dropped for missing optional fields.

Export formats

After a run completes, click Export in the Apify Console to download:

JSON — full fidelity, all fields, newline-delimited
CSV — flat, one row per repo
Excel — .xlsx via the Apify dataset converter
XML — structured per-item

All formats are also available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.

💰 Pricing

Pay-Per-Event (PPE) — you pay only for what you use:

Event	Price (USD)	When
`actor-start`	$0.05	Once per run, at boot
`result-row`	$0.002	Per repo row written in list mode (`includeDetails=false`)
`result-row-detailed`	$0.005	Per repo row written in detail mode or `repoId` mode

Example costs

Rows scraped	Mode	Actor starts	Total cost
100	list	1	$0.25
500	list	1	$1.05
1,000	list	1	$2.05
1,000	detail	1	$5.05
5,000	list	1	$10.05
5,000	detail	1	$25.05

Honest pricing, no fine print. Consistent with the llm-pricing-monitor companion Actor so the AI Stack Intelligence suite bills at a uniform rate.

🚧 Limitations

Private and gated repos are not accessible. The unauthenticated public API only returns publicly visible data.
Rate limit: 500 requests per 5 minutes (verified 2026-05-16 via ratelimit-policy header). At default page size 100, this allows ~10,000 list rows per 5-min window. Detail mode adds one request per row, halving throughput to ~250 enriched rows per minute.
Spaces never have a downloads metric. The field is always null for repo_type=space — verified on both list and detail endpoints.
Sparse Spaces list: repo_owner, last_modified, and space_runtime_stage require detail mode for Spaces.
Safetensors and GGUF fields need detail mode. safetensors_total_params, model_size_category, and has_gguf are only populated when includeDetails=true.
No cross-run deduplication. Re-running the same input returns the same repos with refreshed metadata. Use a downstream dedupe pass if you need uniqueness across runs.
No model card or dataset card markdown content. Only structured metadata fields; the README body is excluded as too noisy for a structured dataset.
No HuggingFace Inference API calls or model benchmarking. This Actor only scrapes catalog metadata, not model outputs.
The Apify FREE tier retains run-scoped storage for 7 days only. For longer retention, export your dataset immediately or upgrade to a paid Apify plan.

Tips for best results

Use a trending snapshot weekly to track the rapidly-evolving model leaderboard. Set up an Apify Schedule for a recurring run.
Cap maxResults to what you actually need. The HF Hub has 1M+ models; setting a sensible cap keeps cost and runtime predictable.
Use detail mode sparingly. It is 2.5x the per-row cost and roughly 4x the per-row latency. Prefer list mode for catalog snapshots; flip to detail mode only when you need safetensors, GGUF, or runtime stage.
Combine with llm-pricing-monitor to correlate open-weights releases on the Hub with hosted-API price moves.

❓ FAQ

Do I need a HuggingFace account or API token?

No account and no API token are required to run this scraper. The HuggingFace Hub exposes read access to public repo metadata via a public REST API. This Actor uses that interface only — it never accesses gated content, never submits content, and never touches private repos.

What is the difference between list mode and detail mode?

List mode (includeDetails=false, $0.002/row) makes one API request per page of 100 rows and returns repo_id, owner, downloads, likes, tags, pipeline_tag, library_name, gated flag, and timestamps. Detail mode (includeDetails=true, $0.005/row) additionally calls the per-repo endpoint to fetch safetensors parameter counts, GGUF file detection, Space runtime stage, and dataset card data. Use detail mode when you need those enriched fields; otherwise list mode gives you 2.5x cheaper rows and faster throughput.

How do I get trending models — huggingface trending models export?

Leave all four filter fields (filterTags, searchQuery, author, repoId) blank, keep the default sort=downloads, and set maxResults to the top-N you want (e.g. 100). To use the HF native "trending" score instead of download count, set sort=trending.

Can I do a HuggingFace dataset export or scrape Spaces too?

Yes. Switch the repoType input to dataset or space. The same filtering, pagination, and detail-mode features apply across all three repo types. Note that Spaces never carry a downloads count and need detail mode for repo_owner / last_modified / runtime_stage.

Why are some fields null on my rows?

HuggingFace's list endpoint returns different field sets for different repo types. Space list rows lack repo_owner, last_modified, and runtime entirely — detail mode populates them. The null values are accurate, not bugs.

What is GGUF and why detect it?

GGUF (GPT-Generated Unified Format) is the quantized model file format used by llama.cpp, LM Studio, and Ollama for local CPU inference. When a model repo includes a .gguf sibling file, it can run on consumer hardware without a GPU. Set includeDetails=true and filter your dataset on has_gguf=true to list all inference-ready open-weights models.

Is this an HuggingFace API wrapper — how does it differ from huggingface_hub?

The official huggingface_hub Python library is excellent for one-off queries inside your own code. This Actor is designed for the batched, scheduled, cross-author snapshot use case: large paginated exports, recurring runs on Apify Schedules, clean CSV/JSON for BI tools and dashboards — without writing or maintaining any scraping infrastructure yourself.

Is scraping the HuggingFace Hub legal?

This Actor only reads publicly visible repository metadata via the documented public API. It never bypasses authentication, never accesses gated content, and never submits content. Always verify the current Terms of Service at huggingface.co/terms-of-service and your local data-protection rules before using scraped data commercially.

💬 Your feedback

Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.

HuggingFace Hub Scraper - Models, Datasets, Spaces

wetyr_corporation/huggingface-hub-scraper

Bulk extract AI models, datasets, and Spaces from HuggingFace. Filter by task, library, license, author. Pulls downloads, likes, tags, model cards.

WETYR

HuggingFace Trending Models, Datasets & Spaces Scraper

outofboundslab/hf-trending-scraper

Scrape trending models, datasets, and spaces from HuggingFace Hub. Get download counts, likes, tags, pipeline types, licenses, and more. Sort by downloads, likes, or trending. Filter by task type.

Julian Bracaglia

HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

makework36/huggingface-hub-scraper

Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.

deusex machine

HuggingFace Models Scraper

resounding_diplomacy/huggingface-models-scraper

Scrapes HuggingFace Hub for AI/ML models — trending, most downloaded, by task type, by author. Extracts model name, author, downloads, likes, task category, tags, pipeline tag, last modified, model card URL. Uses the HuggingFace JSON API for reliable structured data extraction.

alars num

Huggingface Intelligence Scraper

mattdef/huggingface-intelligence-scraper

Scrape Hugging Face models, datasets, and spaces via public API. Get downloads, likes, trending models, pipeline tags, and more. Perfect for AI market research.

Matthieu Cast

HuggingFace Models Scraper

solidcode/huggingface-co-scraper

[💰 $1.00 / 1K] Extract model metadata from the HuggingFace Hub — downloads, likes, trending score, task, library, license, tags, dates, and file lists. Search by keyword, filter by author, task, library, or tag, and sort by popularity or date.

SolidCode

HuggingFace Models Scraper

tzmyk/huggingface-models-scraper

Scrapes AI/ML models from HuggingFace (huggingface.co/models) via the official API. Extracts model ID, downloads, likes, task type, library, tags, and more. Supports search, author/org filter, pipeline tag filter, and sort order.

tzmyk

HuggingFace Model Tracker

optimus-fulcria/huggingface-model-tracker

Track trending, popular, and new AI models on HuggingFace. Monitor downloads, likes, trending scores. Filter by task type, library, or author. No API key required.

Fulcria Labs

Huggingface Models Scraper

klondikeking/huggingface-models-scraper

Pierrick McD0nald

HuggingFace Hub Scraper

crawlerbros/huggingface-scraper

Scrape Hugging Face Hub, search and fetch models, datasets, and spaces with full metadata: downloads, likes, license, pipeline tag, library, tags, files, and more. Pure HTTP, no auth required.