HuggingFace Scraper — Models, Datasets & Spaces avatar

HuggingFace Scraper — Models, Datasets & Spaces

Pricing

Pay per event

Go to Apify Store
HuggingFace Scraper — Models, Datasets & Spaces

HuggingFace Scraper — Models, Datasets & Spaces

Export models, datasets, and Spaces from the HuggingFace Hub API — filter by task, library, or author, with a trending snapshot mode — to JSON or CSV. Richer schema than incumbents: downloads, likes, tags, license, last-modified. No login.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

11 days ago

Last modified

Share

HuggingFace Hub Scraper

HuggingFace Scraper — Models, Datasets & Spaces

We do the dirty work so your dataset stays clean. 😈

$2.05 / 1,000 rows — pay only for results that land. No credit card to try.

Export structured metadata for models, datasets, and Spaces from the HuggingFace Hub. Filter by task tag, library, author, or free-text search. Trending snapshot mode included. One Actor handles all three repo types; Pydantic-validated rows land in a dataset you can download as JSON, CSV, Excel, or XML.

🎯 What this scrapes

Three repo types, one HuggingFace scraper:

  1. Models — downloads, likes, pipeline tag, library name, tags, safetensors parameter count, GGUF file detection, and size-category bucketing.
  2. Datasets — task categories, size categories, and language codes parsed from tag prefixes.
  3. Spaces — SDK name and runtime stage (RUNNING / SLEEPING / STOPPED) in detail mode.

Five filter modes, mutually exclusive — use at most one per run:

  • Trending snapshot — leave all filters blank to capture the top-N repos by downloads, likes, trending, last_modified, or created_at.
  • Tag filter — pass filterTags to restrict to repos carrying specific tags (e.g. text-generation).
  • Search query — pass searchQuery for free-text search across repo metadata.
  • Author — pass author for every public repo from one org or user (e.g. openai).
  • Single-repo deep fetch — pass repoId in owner/name form to call only the detail endpoint and emit one richly-enriched row.

🔥 Features

  • Three repo types in one Actor: model, dataset, space — pick via the repoType selector.
  • Five sort fields: downloads, likes, trending, last_modified, created_at (all descending).
  • Five filter modes: tag list, free-text search, author/org, single-repo deep fetch, or no filter (trending snapshot).
  • Optional includeDetails mode — calls the per-repo detail endpoint to enrich with safetensors parameter counts, GGUF file detection, and Space runtime stage.
  • GGUF detection flag and derived model_size_category bucket (<1B / 1B-7B / 7B-13B / 13B+) ready for downstream pricing or hardware-fit dashboards.
  • Dataset tag prefixes auto-parsed into structured arrays: size_categories:, task_categories:, language:.
  • Pydantic v2 input validation — at most one filter may be set; invalid input fails fast with a clear error before any network call.
  • Exponential backoff on 429 and 503; honours Retry-After; max 5 attempts per endpoint call.
  • Browser fingerprint impersonation via curl-cffi — no scraper-detectable headers leave the Actor.
  • Companion to llm-pricing-monitor as part of the AI Stack Intelligence suite.

💡 Use cases

  • AI researcher trend tracking — pull the trending top-100 models weekly and feed a time-series dashboard tracking which model families dominate the Hub.
  • Investor adoption monitoring — measure download velocity for specific model families (pipeline_tag=text-generation + library_name=transformers) to inform AI infrastructure investment theses.
  • Fine-tuner catalog survey — enumerate every model under a pipeline_tag like image-segmentation to map the open-weights landscape before choosing a base model.
  • Dataset discovery — filter datasets by task_categories and language to find labelled training corpora for a downstream NLP/vision/audio model.
  • Hardware-fit analysis — use safetensors_total_params and model_size_category to filter models that fit a target memory budget before benchmarking.
  • GGUF availability tracking — set includeDetails=true and filter on has_gguf=true to find quantized inference-ready models for llama.cpp or LM Studio.
  • Space monitoring — capture which Spaces are RUNNING vs SLEEPING for a creator or topic, useful for community health dashboards.
  • Content creator coverage — feed every model from a popular org like openai or meta-llama into a content pipeline for blog posts or YouTube videos.
  • Track HuggingFace model downloads over time — schedule recurring runs to build a time-series of download counts and detect fast-movers before they hit mainstream coverage.

⚙️ How to use it

  1. Open the Actor input form.
  2. Pick Repo typemodel, dataset, or space.
  3. Optionally pick a Sort field (default downloads).
  4. Set at most one filter: filterTags, searchQuery, author, or repoId. Leave all four blank for a trending snapshot.
  5. Adjust Max results (default 100, maximum 5000). Ignored in repoId mode (always 1 row).
  6. Toggle Include detail enrichment on if you need safetensors parameter counts, GGUF detection, or Space runtime stage. Detail mode is charged at $0.005/row instead of $0.002/row.
  7. Click Start and watch the run log. Results stream into the default dataset and can be downloaded as JSON, CSV, Excel, or XML via the Export button.

What we handle for you

We absorb every failure mode you'd otherwise hit running this yourself:

  • We rotate browser fingerprintscurl-cffi impersonates real browser TLS signatures (Chrome / Firefox) so our requests look like genuine browser traffic.
  • We retry with exponential backoff on 408 / 429 / 503 and honour Retry-After. Up to 5 attempts per page before surfacing a partial-success status.
  • We rotate proxies through Apify Proxy on blocks — fresh session ID, fresh exit IP, back on track.
  • We back off when the target rate-limits — partial successes surface with a clear status message; we never silently return an empty dataset.
  • We keep your dataset clean — Pydantic-validated rows, ISO-8601 timestamps, stable repo IDs, null for absent optional fields rather than missing keys.
  • You pay only for results that land — if no data comes back, you're not charged for rows (only the small Actor-start warm-up fee).

Quick examples

Trending top-100 models, list mode:

{
"repoType": "model",
"sort": "downloads",
"maxResults": 100,
"includeDetails": false
}

Every transformers text-generation model with safetensors and GGUF detection:

{
"repoType": "model",
"filterTags": ["text-generation", "transformers"],
"maxResults": 500,
"includeDetails": true
}

Single-repo deep fetch:

{
"repoType": "model",
"repoId": "openai/whisper-large-v3"
}

HuggingFace dataset export by language:

{
"repoType": "dataset",
"filterTags": ["language:fr"],
"maxResults": 200,
"includeDetails": false
}

📥 Input

FieldTypeRequiredDefaultDescription
repoTypestringyesmodel, dataset, or space
sortstringnodownloadsOne of downloads, likes, last_modified, created_at, trending
filterTagsarrayone-ofTag filter; joined with comma for HF filter= param
searchQuerystringone-ofFree-text search via HF search=
authorstringone-ofOrg or user slug via HF author=
repoIdstringone-ofSingle owner/name deep-fetch; forces detail mode
maxResultsintegerno100Max rows emitted (1 to 5000)
includeDetailsbooleannofalsePer-row detail enrichment
proxyConfigurationobjectnoApify Proxy configuration (recommended for high-volume runs)

At most one of filterTags, searchQuery, author, repoId may be non-null. Setting two or more raises a Pydantic validation error before any network call.

📤 Output

One row per repo. Type-specific fields are null for repo types where they don't apply.

FieldTypeDescription
repo_typestringOne of model, dataset, space
repo_idstringHuggingFace repo identifier in owner/name form
repo_ownerstring | nullOwning org or user slug
repo_namestringLast segment of repo_id
repo_urlstringhttps://huggingface.co/{repo_id}
downloadsinteger | null30-day rolling download count (always null for Spaces)
likesintegerLike count
created_atstringISO 8601 creation timestamp
last_modifiedstring | nullISO 8601 last-modified timestamp
tagsarrayRepo tags (may be empty)
gatedboolean | nullmanual → true, false → false, absent → null
privatebooleanAlways false for public catalog entries
pipeline_tagstring | nullModel pipeline tag (models only)
library_namestring | nullModel library name (models only)
safetensors_total_paramsinteger | nullTotal safetensors parameter count (detail mode, models only)
model_size_categorystring | nullBucket label <1B / 1B-7B / 7B-13B / 13B+
has_ggufboolean | nullTrue if any sibling filename ends .gguf (detail mode)
dataset_size_categoriesarray | nullStripped from size_categories: tag prefix (datasets only)
dataset_task_categoriesarray | nullStripped from task_categories: tag prefix (datasets only)
dataset_languagesarray | nullStripped from language: tag prefix (datasets only)
space_sdkstring | nullSpace SDK (gradio, streamlit, docker, static)
space_runtime_stagestring | nullSpace runtime stage (detail mode, Spaces only)
scraped_atstringISO 8601 UTC datetime this row was written
{
"repo_type": "model",
"repo_id": "openai/whisper-large-v3",
"repo_owner": "openai",
"repo_name": "whisper-large-v3",
"repo_url": "https://huggingface.co/openai/whisper-large-v3",
"downloads": 4932732,
"likes": 5690,
"created_at": "2023-11-07T18:41:14.000Z",
"last_modified": "2024-08-12T10:20:10.000Z",
"tags": ["transformers", "safetensors", "whisper", "automatic-speech-recognition"],
"gated": false,
"private": false,
"pipeline_tag": "automatic-speech-recognition",
"library_name": "transformers",
"safetensors_total_params": 1543490560,
"model_size_category": "1B-7B",
"has_gguf": false,
"dataset_size_categories": null,
"dataset_task_categories": null,
"dataset_languages": null,
"space_sdk": null,
"space_runtime_stage": null,
"scraped_at": "2026-05-16T12:00:00+00:00"
}

Optional fields are emitted as null when the API does not return them. Rows are never dropped for missing optional fields.

Export formats

After a run completes, click Export in the Apify Console to download:

  • JSON — full fidelity, all fields, newline-delimited
  • CSV — flat, one row per repo
  • Excel.xlsx via the Apify dataset converter
  • XML — structured per-item

All formats are also available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.

💰 Pricing

Pay-Per-Event (PPE) — you pay only for what you use:

EventPrice (USD)When
actor-start$0.05Once per run, at boot
result-row$0.002Per repo row written in list mode (includeDetails=false)
result-row-detailed$0.005Per repo row written in detail mode or repoId mode

Example costs

Rows scrapedModeActor startsTotal cost
100list1$0.25
500list1$1.05
1,000list1$2.05
1,000detail1$5.05
5,000list1$10.05
5,000detail1$25.05

Honest pricing, no fine print. Consistent with the llm-pricing-monitor companion Actor so the AI Stack Intelligence suite bills at a uniform rate.

🚧 Limitations

  • Private and gated repos are not accessible. The unauthenticated public API only returns publicly visible data.
  • Rate limit: 500 requests per 5 minutes (verified 2026-05-16 via ratelimit-policy header). At default page size 100, this allows ~10,000 list rows per 5-min window. Detail mode adds one request per row, halving throughput to ~250 enriched rows per minute.
  • Spaces never have a downloads metric. The field is always null for repo_type=space — verified on both list and detail endpoints.
  • Sparse Spaces list: repo_owner, last_modified, and space_runtime_stage require detail mode for Spaces.
  • Safetensors and GGUF fields need detail mode. safetensors_total_params, model_size_category, and has_gguf are only populated when includeDetails=true.
  • No cross-run deduplication. Re-running the same input returns the same repos with refreshed metadata. Use a downstream dedupe pass if you need uniqueness across runs.
  • No model card or dataset card markdown content. Only structured metadata fields; the README body is excluded as too noisy for a structured dataset.
  • No HuggingFace Inference API calls or model benchmarking. This Actor only scrapes catalog metadata, not model outputs.
  • The Apify FREE tier retains run-scoped storage for 7 days only. For longer retention, export your dataset immediately or upgrade to a paid Apify plan.

Tips for best results

  • Use a trending snapshot weekly to track the rapidly-evolving model leaderboard. Set up an Apify Schedule for a recurring run.
  • Cap maxResults to what you actually need. The HF Hub has 1M+ models; setting a sensible cap keeps cost and runtime predictable.
  • Use detail mode sparingly. It is 2.5x the per-row cost and roughly 4x the per-row latency. Prefer list mode for catalog snapshots; flip to detail mode only when you need safetensors, GGUF, or runtime stage.
  • Combine with llm-pricing-monitor to correlate open-weights releases on the Hub with hosted-API price moves.

❓ FAQ

Do I need a HuggingFace account or API token?

No account and no API token are required to run this scraper. The HuggingFace Hub exposes read access to public repo metadata via a public REST API. This Actor uses that interface only — it never accesses gated content, never submits content, and never touches private repos.

What is the difference between list mode and detail mode?

List mode (includeDetails=false, $0.002/row) makes one API request per page of 100 rows and returns repo_id, owner, downloads, likes, tags, pipeline_tag, library_name, gated flag, and timestamps. Detail mode (includeDetails=true, $0.005/row) additionally calls the per-repo endpoint to fetch safetensors parameter counts, GGUF file detection, Space runtime stage, and dataset card data. Use detail mode when you need those enriched fields; otherwise list mode gives you 2.5x cheaper rows and faster throughput.

How do I get trending models — huggingface trending models export?

Leave all four filter fields (filterTags, searchQuery, author, repoId) blank, keep the default sort=downloads, and set maxResults to the top-N you want (e.g. 100). To use the HF native "trending" score instead of download count, set sort=trending.

Can I do a HuggingFace dataset export or scrape Spaces too?

Yes. Switch the repoType input to dataset or space. The same filtering, pagination, and detail-mode features apply across all three repo types. Note that Spaces never carry a downloads count and need detail mode for repo_owner / last_modified / runtime_stage.

Why are some fields null on my rows?

HuggingFace's list endpoint returns different field sets for different repo types. Space list rows lack repo_owner, last_modified, and runtime entirely — detail mode populates them. The null values are accurate, not bugs.

What is GGUF and why detect it?

GGUF (GPT-Generated Unified Format) is the quantized model file format used by llama.cpp, LM Studio, and Ollama for local CPU inference. When a model repo includes a .gguf sibling file, it can run on consumer hardware without a GPU. Set includeDetails=true and filter your dataset on has_gguf=true to list all inference-ready open-weights models.

Is this an HuggingFace API wrapper — how does it differ from huggingface_hub?

The official huggingface_hub Python library is excellent for one-off queries inside your own code. This Actor is designed for the batched, scheduled, cross-author snapshot use case: large paginated exports, recurring runs on Apify Schedules, clean CSV/JSON for BI tools and dashboards — without writing or maintaining any scraping infrastructure yourself.

Is scraping the HuggingFace Hub legal?

This Actor only reads publicly visible repository metadata via the documented public API. It never bypasses authentication, never accesses gated content, and never submits content. Always verify the current Terms of Service at huggingface.co/terms-of-service and your local data-protection rules before using scraped data commercially.

💬 Your feedback

Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.