HuggingFace Hub Scraper avatar

HuggingFace Hub Scraper

Pricing

Pay per event

Go to Apify Store
HuggingFace Hub Scraper

HuggingFace Hub Scraper

Export models, datasets, and Spaces from HuggingFace Hub. Filter by task, library, or author. Trending snapshot mode. No login needed. Richer schema than incumbents.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

2 days ago

Last modified

Share

HuggingFace Hub Scraper

HuggingFace Hub Scraper

We do the dirty work so your dataset stays clean. 😈

$2.05 / 1,000 rows in list mode — $5.05 / 1,000 rows in detail mode — Export structured metadata for models, datasets, and Spaces from the HuggingFace Hub via the public REST API. One Actor handles all three repo types via a repoType selector. No login. No API key. No browser automation.

This Actor calls HuggingFace's public REST API at https://huggingface.co/api/, paginates the list endpoint, optionally enriches each row with the per-repo detail endpoint (safetensors parameter count, GGUF file detection, Space runtime stage), and emits a flat Pydantic-validated dataset ready for direct analysis in spreadsheets, BI tools, or SQL.

🎯 What this scrapes

Three repo types, one Actor:

  1. Models — every model on the Hub, with downloads, likes, pipeline tag, library name, tags, and optional safetensors parameter count + GGUF detection.
  2. Datasets — every dataset on the Hub, with task categories, size categories, and language codes parsed out of tag prefixes.
  3. Spaces — every Space on the Hub, with SDK name and optional runtime stage (RUNNING / SLEEPING / STOPPED) from detail mode.

Four filtering modes (mutually exclusive — at most one may be set):

  • Trending snapshot — leave all filters blank to capture the top-N trending repos by downloads, likes, trending, last_modified, or created_at.
  • Tag filter — pass filterTags to restrict to repos carrying specific tags (e.g. text-generation).
  • Search query — pass searchQuery for free-text search across repo metadata.
  • Author — pass author for every public repo from one org or user (e.g. openai).
  • Single-repo deep fetch — pass repoId in owner/name form to call only the detail endpoint and emit one richly-enriched row.
FieldTypeDescription
repo_typestringOne of model, dataset, space
repo_idstringHuggingFace repo identifier in owner/name form
repo_ownerstring | nullOwning org or user slug
repo_namestringLast segment of repo_id
repo_urlstringhttps://huggingface.co/{repo_id}
downloadsinteger | null30-day rolling download count (always null for Spaces)
likesintegerLike count
created_atstringISO 8601 creation timestamp
last_modifiedstring | nullISO 8601 last-modified timestamp
tagsarrayRepo tags (may be empty)
gatedboolean | nullmanual -> true, false -> false, absent -> null
privatebooleanAlways false for public catalog entries
pipeline_tagstring | nullModel pipeline tag (models only)
library_namestring | nullModel library name (models only)
safetensors_total_paramsinteger | nullTotal safetensors parameter count (detail mode, models only)
model_size_categorystring | nullBucket label <1B / 1B-7B / 7B-13B / 13B+
has_ggufboolean | nullTrue if any sibling filename ends .gguf (detail mode)
dataset_size_categoriesarray | nullStripped from size_categories: tag prefix (datasets only)
dataset_task_categoriesarray | nullStripped from task_categories: tag prefix (datasets only)
dataset_languagesarray | nullStripped from language: tag prefix (datasets only)
space_sdkstring | nullSpace SDK (gradio, streamlit, docker, static)
space_runtime_stagestring | nullSpace runtime stage (detail mode, Spaces only)
scraped_atstringISO 8601 UTC datetime this row was written

🔥 Features

  • No HuggingFace account or API token required — uses the public unauthenticated Hub API.
  • Three repo types in one Actor: model, dataset, space — pick via the repoType selector.
  • Five sort fields: downloads, likes, trending, last_modified, created_at (all descending).
  • Five filter modes: tag list, free-text search, author/org, single-repo deep fetch, or no filter at all (trending snapshot).
  • Optional includeDetails mode — calls the per-repo detail endpoint to enrich with safetensors parameter counts, GGUF file detection, and Space runtime stage.
  • GGUF detection flag and derived model_size_category bucket (<1B / 1B-7B / 7B-13B / 13B+) ready for downstream pricing or hardware-fit dashboards.
  • Dataset tag prefixes auto-parsed into structured arrays: size_categories:, task_categories:, language:.
  • Pydantic v2 input validation — at most one filter may be set; invalid input fails fast with a clear error before any network call.
  • Exponential backoff on 429 and 503; honours Retry-After; max 5 attempts; respects HuggingFace's 500-req / 5-min rate limit.
  • Pure HTTP client (curl-cffi with browser fingerprint impersonation) — no browser automation, low compute footprint.
  • Companion to llm-pricing-monitor and the planned LMSys leaderboard scraper as the AI Stack Intelligence suite.

💡 Use cases

  • AI researcher trend tracking — pull the trending top-100 models weekly and feed a time-series dashboard tracking which model families dominate the Hub.
  • Investor adoption monitoring — measure download velocity for specific model families (pipeline_tag=text-generation + library_name=transformers) to inform AI infrastructure investment theses.
  • Fine-tuner catalog survey — enumerate every model under a pipeline_tag like image-segmentation to map the open-weights landscape before choosing a base model.
  • Dataset discovery — filter datasets by task_categories and language to find labelled training corpora for a downstream NLP/vision/audio model.
  • Hardware-fit analysis — use safetensors_total_params and model_size_category to filter models that fit a target memory budget before benchmarking.
  • GGUF availability tracking — set includeDetails=true and filter on has_gguf=true to find quantized inference-ready models for llama.cpp or LM Studio.
  • Space monitoring — capture which Spaces are RUNNING vs SLEEPING for a creator or topic, useful for community health dashboards.
  • Content creator coverage — feed every model from a popular org like openai or meta-llama into a content pipeline for blog posts or YouTube videos.

⚙️ How to use it

  1. Open the Actor input form.
  2. Pick Repo typemodel, dataset, or space.
  3. Optionally pick a Sort field (default downloads).
  4. Set at most one filter: filterTags, searchQuery, author, or repoId. Leave all four blank for a trending snapshot.
  5. Adjust Max results (default 100, maximum 5000). Ignored in repoId mode (always 1 row).
  6. Toggle Include detail enrichment on if you want safetensors parameter counts, GGUF detection, or Space runtime stage. Detail mode is charged at $0.005/row instead of $0.002/row.
  7. Leave Use Apify Proxy off unless you are behind a restrictive ISP — the HuggingFace API does not block datacenter IPs.
  8. Click Start and watch the run log. Results stream into the default dataset and can be downloaded as JSON, CSV, Excel, or XML via the Export button.

Quick examples

Trending top-100 models, list mode:

{
"repoType": "model",
"sort": "downloads",
"maxResults": 100,
"includeDetails": false
}

Every transformers text-generation model, with safetensors and GGUF detection:

{
"repoType": "model",
"filterTags": ["text-generation", "transformers"],
"maxResults": 500,
"includeDetails": true
}

Single-repo deep fetch:

{
"repoType": "model",
"repoId": "openai/whisper-large-v3"
}

📥 Input

FieldTypeRequiredDefaultDescription
repoTypestringyesmodel, dataset, or space
sortstringnodownloadsOne of downloads, likes, last_modified, created_at, trending
filterTagsarrayone-ofTag filter; joined with comma for HF filter= param
searchQuerystringone-ofFree-text search via HF search=
authorstringone-ofOrg or user slug via HF author=
repoIdstringone-ofSingle owner/name deep-fetch; forces detail mode
maxResultsintegerno100Max rows emitted (1 to 5000)
includeDetailsbooleannofalsePer-row detail enrichment
useProxybooleannofalseRoute via Apify Proxy (BUYPROXIES94952)

At most one of filterTags, searchQuery, author, repoId may be non-null. Setting two or more raises a Pydantic validation error before any network call.

📤 Output

One row per repo. All fields documented in the field table above; type-specific fields are null for repo types where they don't apply.

{
"repo_type": "model",
"repo_id": "openai/whisper-large-v3",
"repo_owner": "openai",
"repo_name": "whisper-large-v3",
"repo_url": "https://huggingface.co/openai/whisper-large-v3",
"downloads": 4932732,
"likes": 5690,
"created_at": "2023-11-07T18:41:14.000Z",
"last_modified": "2024-08-12T10:20:10.000Z",
"tags": ["transformers", "safetensors", "whisper", "automatic-speech-recognition"],
"gated": false,
"private": false,
"pipeline_tag": "automatic-speech-recognition",
"library_name": "transformers",
"safetensors_total_params": 1543490560,
"model_size_category": "1B-7B",
"has_gguf": false,
"dataset_size_categories": null,
"dataset_task_categories": null,
"dataset_languages": null,
"space_sdk": null,
"space_runtime_stage": null,
"scraped_at": "2026-05-16T12:00:00+00:00"
}

Optional fields (repo_owner, downloads, last_modified, gated, pipeline_tag, library_name, safetensors_total_params, model_size_category, has_gguf, the dataset_* arrays, space_sdk, space_runtime_stage) are emitted as null when the API does not return them. Rows are never dropped for missing optional fields.

Export formats

After a run completes, click Export in the Apify Console to download:

  • JSON — full fidelity, all fields, newline-delimited
  • CSV — flat, one row per repo
  • Excel.xlsx via the Apify dataset converter
  • XML — structured per-item

All formats are available via the Apify API: GET /datasets/{id}/items?format=csv&clean=true.

💰 Pricing

Pay-Per-Event (PPE) — you pay only for what you use:

EventPrice (USD)When
actor-start$0.05Once per run, at boot
result-row$0.002Per repo row written in list mode (includeDetails=false)
result-row-detailed$0.005Per repo row written in detail mode or repoId mode

Example costs

Rows scrapedModeActor startsTotal cost
100list1$0.25
500list1$1.05
1,000list1$2.05
1,000detail1$5.05
5,000list1$10.05
5,000detail1$25.05

This rate is consistent with the llm-pricing-monitor companion Actor so the AI Stack Intelligence suite has uniform pricing.

🚧 Limitations

  • Private and gated repos are not accessible. The unauthenticated public API only returns publicly visible data.
  • Rate limit: 500 requests per 5 minutes (verified 2026-05-16 via ratelimit-policy header). At default page size 100, this allows ~10,000 list rows per 5-min window. Detail mode adds 1 request per row, halving throughput to ~250 enriched rows/minute.
  • Spaces never have a downloads metric. The field is always null for repo_type=space — verified both list and detail endpoints.
  • Sparse Spaces list: repo_owner, last_modified, and space_runtime_stage require detail mode for Spaces.
  • Sparse model list without detail mode: while full=true is always sent, safetensors_total_params, model_size_category, and has_gguf are only populated in detail mode.
  • No cross-run deduplication. Re-running the same input returns the same repos with refreshed metadata. Use a downstream dedupe pass if you need uniqueness across runs.
  • No model card or dataset card markdown content. Only structured metadata fields; the README body is excluded as too noisy for a structured dataset.
  • No HuggingFace Inference API calls or model benchmarking. This Actor only scrapes catalog metadata, not model outputs.
  • The Apify FREE tier retains run-scoped storage for 7 days only. For longer retention, export your dataset immediately or upgrade to a paid Apify plan.

Tips for best results

  • Use a trending snapshot weekly to track the rapidly-evolving model leaderboard. Set up an Apify Schedule for a recurring run.
  • Cap maxResults to what you actually need. The HF Hub has 1M+ models; setting a sensible cap keeps cost and runtime predictable.
  • Use detail mode sparingly. It is 2.5x the per-row cost and 4x the per-row latency. Prefer list mode for catalog snapshots; flip to detail mode only when you need safetensors / GGUF / runtime.
  • Combine with llm-pricing-monitor as the AI Stack Intelligence suite to correlate open-weights releases with hosted-API price moves.

Integrations

This Actor works natively with the Apify platform's built-in connectors:

  • Apify API — trigger runs programmatically, poll for status, and fetch dataset items via REST. Full OpenAPI spec at https://docs.apify.com/api/v2.
  • Webhooks — configure a webhook to POST the run result to your endpoint as soon as the Actor finishes.
  • Apify Schedules — run this Actor on a cron schedule to keep a trending leaderboard dataset fresh.
  • Make (formerly Integromat) — use the Apify Make module to trigger runs and route results to Google Sheets, Airtable, Slack, or anywhere Make connects.
  • Zapier — Apify's Zapier integration triggers on run completion and passes dataset items downstream.
  • n8n — use the HTTP Request node with the Apify REST API for fully self-hosted automation pipelines.

❓ FAQ

Do I need a HuggingFace account?

No. The HuggingFace Hub public API at huggingface.co/api/ is unauthenticated by design for read access to public repos. No account, no API token, no rate-limit credentials needed.

What is the difference between list mode and detail mode?

List mode (includeDetails=false, $0.002/row) makes one API request per page of 100 rows. You get repo_id, owner, downloads, likes, tags, pipeline_tag, library_name, gated flag, and timestamps. Detail mode (includeDetails=true, $0.005/row) additionally makes one request per row to fetch safetensors parameter counts, GGUF file detection, Space runtime stage, and dataset cardData. Use detail mode when you need any of those enriched fields; otherwise stick to list mode for 2.5x cheaper rows and faster throughput.

How do I get trending models?

Leave all four filter fields (filterTags, searchQuery, author, repoId) blank, keep the default sort=downloads, and set maxResults to the top-N you want (e.g. 100). The Actor returns the most-downloaded models on the Hub, descending. To use the HF "trending" score instead, set sort=trending.

Can I scrape datasets and Spaces too?

Yes. Switch the repoType input to dataset or space and run again. The same filtering, pagination, and detail-mode features apply across all three repo types. Note that Spaces never have a downloads count and require detail mode for repo_owner / last_modified / runtime_stage.

Why are some fields null on my rows?

HuggingFace's list endpoint returns different field sets for different repo types. For example, model list rows lack repo_owner unless full=true is sent (this Actor always sends it). Space list rows lack repo_owner, last_modified, and runtime entirely — you must use detail mode to populate them. Dataset list rows are the richest by default. The null values are accurate, not bugs.

What is GGUF and why detect it?

GGUF (GPT-Generated Unified Format) is the quantized model file format used by llama.cpp, LM Studio, and Ollama for local CPU inference. When a model repo includes a .gguf sibling file, it can run on consumer hardware without a GPU. Set includeDetails=true and filter your dataset on has_gguf=true to find inference-ready open-weights models.

Why is useProxy off by default?

The HuggingFace Hub API does not block datacenter IPs, so direct routing is faster and free. Enable proxy only if you are behind a restrictive ISP or firewall.

Is scraping the HuggingFace Hub legal?

The HuggingFace public API at huggingface.co/api/ is unauthenticated and explicitly designed for programmatic access to public repo metadata. The HuggingFace Terms of Service permit accessing public data via the API. This Actor never bypasses authentication, never accesses gated content, and never submits content — it only reads public repository metadata. Always verify the current Terms of Service at huggingface.co/terms-of-service and your local jurisdiction's data-protection rules before using scraped data for commercial purposes.

  • llm-pricing-monitor (planned) — companion in the AI Stack Intelligence suite. Tracks per-token pricing for hosted LLM APIs (OpenAI, Anthropic, Google, Mistral) so you can correlate open-weights model releases on the HF Hub with hosted-API price moves.

💬 Your feedback

Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at apify.com/DevilScrapes. We ship updates within days of validated reports.