Pricing

from $30.00 / 1,000 datasets

Hugging Face Datasets Catalog — ML Training Data Intel

Hugging Face dataset registry: downloads, likes, last_modified, task_categories, language, size_categories, license, tags, author. Filter by task/language/size. Sort by downloads/likes/trending/modified. ML researchers, MLOps, AI compliance.

Pricing

from $30.00 / 1,000 datasets

Rating

0.0

(0)

Developer

NexGenData

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

Hugging Face Datasets Catalog — ML Training Data Intel in One Flat Row

Pull a structured, filterable, sortable catalog of the Hugging Face Hub's dataset registry — the single largest open inventory of ML training and evaluation data on the public internet. Each dataset emits one flat row with id, author, total downloads, likes, last_modified, created_at, task_categories (the canonical Hugging Face task taxonomy — text-classification, automatic-speech-recognition, image-segmentation, reinforcement-learning, etc.), language (ISO-639 codes), size_categories bucket (n<1K, 1K<n<10K, ..., 100M<n<1B, n>1T), license (SPDX expression: apache-2.0, mit, cc-by-4.0, cc-by-nc-4.0, cc-by-sa-4.0, mit-0, bsd-3-clause, gpl-3.0, other, custom), full tags[] array, gated / disabled / private flags, a 600-char description excerpt, and the canonical huggingface.co/datasets/{id} URL. Built for ML researchers picking training corpora, MLOps teams auditing dataset lineage for model cards, dataset curators tracking who's downloading what, AI compliance / legal teams reviewing license exposure before fine-tuning, foundation-model builders sweeping the hub for new pretraining data, and academic groups building benchmarks and evaluation suites.

Price: $0.00005 per actor start + $0.03 per dataset row (proposal — awaiting Steve approval). A 10-dataset smoke test costs about $0.30, a 50-dataset weekly trending snapshot $1.50, a 200-dataset daily ecosystem sweep $6.00, a 1,000-dataset full archive $30.00.

Why this exists

The Hugging Face Hub hosts over 300,000 datasets as of mid-2026 — public corpora spanning canonical NLP benchmarks (GLUE, SuperGLUE, SQuAD, MMLU, HellaSwag, TruthfulQA, HumanEval), pretraining-scale archives (The Pile, RedPajama, FineWeb, C4, The Stack, mC4, OSCAR), multilingual coverage (FLORES-200, WMT, XNLI), audio (LibriSpeech, Common Voice, FLEURS), vision (ImageNet, COCO, LAION-5B, DataComp-1B), code (CodeSearchNet, StarCoder, MBPP), instruction-tuning (Alpaca, ShareGPT, OpenHermes, Tulu), preference (HH-RLHF, UltraFeedback, Nectar), and millions of community-uploaded niche corpora. The Hub web UI is excellent for browsing one dataset at a time — every page renders the card, the splits, the example viewer, the trending downloads chart — but it is not built for the bulk-discovery question that actually drives an ML researcher's workflow:

"Give me every English text-classification dataset above 100K rows, Apache-2.0-licensed, ranked by downloads — as a CSV I can paste into Notion."

There is no "export the dataset registry as CSV" button. The huggingface_hub Python library has list_datasets() but returns Python objects, not flat row-shaped output that drops into Snowflake, BigQuery, Postgres, Looker, Tableau, or a notebook. The public REST API (https://huggingface.co/api/datasets) is real, documented, requires no auth, and accepts ?sort=, ?filter=, ?search=, &limit=, &offset= params — but using it from a spreadsheet means writing your own client, your own pagination loop, your own normalization of the cardData blob vs the flat tags[] array (the same field — say, language — appears in both with slightly different shapes).

This actor does the join for you and emits one flat row per dataset — drop-in for Snowflake, BigQuery, Postgres, Looker, Tableau, an Excel pivot, a Slack daily digest, or a Claude / GPT prompt that ranks the corpus landscape.

No Hugging Face token. No login. No huggingface_hub.login() boilerplate. Public-only data, polite User-Agent (NexGenData scrapers@thenextgennexus.com), rate-limited to stay well under the documented unauthenticated quota (~500 requests per 5 minutes).

What you get — schema

Each dataset row contains:

Field	Type	Source	Notes
`id`	string	HF list	Canonical hub id — either `name` (legacy canonical) or `org/name`. Pass this to `datasets.load_dataset(id)`.
`pretty_name`	string	`cardData.pretty_name`	Human-readable display name from the dataset card front-matter (often title-case with spaces).
`author`	string	derived	The hub org/user that uploaded the dataset (split from `org/name`, or `cardData.author`).
`downloads`	int	HF list	All-time `datasets.load_dataset()` + direct-file download count.
`likes`	int	HF list	Community upvotes — better quality proxy than raw downloads for non-pretraining use.
`last_modified`	ISO8601	HF list	Last commit timestamp on the dataset repo.
`created_at`	ISO8601	HF list	Initial upload timestamp.
`task_categories`	list	`cardData.task_categories` + `tags[]`	Canonical HF task taxonomy slugs (`text-classification`, `summarization`, `automatic-speech-recognition`, `text-to-image`, …).
`language`	list	`cardData.language` + `tags[]`	ISO 639-1 / 639-3 codes (`en`, `fr`, `de`, `zh`, `ja`, `multilingual`, `code`).
`size_categories`	string	`cardData.size_categories` + `tags[]`	HF size bucket: `n<1K`, `1K<n<10K`, `10K<n<100K`, `100K<n<1M`, `1M<n<10M`, `10M<n<100M`, `100M<n<1B`, `n>1T`.
`license`	string	`cardData.license` + `tags[]`	SPDX expression (`apache-2.0`, `mit`, `cc-by-4.0`, `cc-by-nc-4.0`, `cc-by-sa-4.0`, `cc0-1.0`, `bsd-3-clause`, `gpl-3.0`, `lgpl-3.0`, `mpl-2.0`, `unknown`, `other`).
`tags`	list	HF list	Full raw tags array — `task_categories:...`, `language:...`, `license:...`, `size_categories:...`, `modality:...`, `format:...`, `library:...`, `region:...`, `arxiv:...`, custom user tags.
`gated`	bool	HF list	True if the dataset requires accepting a gating agreement before download.
`disabled`	bool	HF list	True if the dataset is currently disabled by the hub (DMCA, TOS, etc.).
`private`	bool	HF list	True if private — should be `false` on the public listing.
`description`	string	HF list	First 600 chars of the dataset card description (markdown stripped to whitespace).
`dataset_url`	string	derived	`https://huggingface.co/datasets/{id}` — direct link for QA / spot-check.
`data_source`	string	—	`huggingface.co/api/datasets`.

Input

Field	Type	Default	What it does
`task`	string	`""`	Single HF task slug filter (`text-classification`, `question-answering`, `image-classification`, `automatic-speech-recognition`, …). Passed to the API as `filter=task_categories:{slug}`.
`language`	string	`""`	Single ISO 639 language code filter (`en`, `fr`, `zh`, `multilingual`, …). Passed as `filter=language:{code}`.
`size`	enum	`""`	Single size bucket filter (`n<1K`, `1K<n<10K`, ..., `100M<n<1B`, `n>1T`). Passed as `filter=size_categories:{bucket}`.
`sort`	enum	`downloads`	`downloads`, `likes`, `trending`, `lastModified`, `createdAt`.
`search`	string	`""`	Free-text search on `id` + `description`.
`maxResults`	int 1–1000	50	Hard cap on rows emitted.

Plain-English Hugging Face dataset mechanics

If you're a buyer (compliance lead, journalist, product manager) without an ML background, here are the platform quirks that matter:

task_categories is the headline taxonomy. Every well-maintained dataset declares one or more task_categories in its dataset-card front-matter — the canonical answer to "what is this dataset for?" — text-classification for sentiment / topic / spam, question-answering for SQuAD-style, automatic-speech-recognition for ASR, text-to-image for diffusion, reinforcement-learning for RLHF. When you filter by task=, you're filtering on this canonical column — not on a free-text guess.

size_categories is an order-of-magnitude bucket, not an exact row count. HF defines fixed buckets and the uploader picks one. Size refers to the row count of the largest split (typically train), not file size on disk. Use size_categories to filter "toy benchmark" (n<1K, 1K<n<10K) from "fine-tuning scale" (100K<n<1M, 1M<n<10M) from "pretraining scale" (100M<n<1B, n>1T).

language is multi-valued. A multilingual dataset will carry one language:en, one language:fr, one language:de, etc. tag per language it covers. The actor's language[] field preserves all of them. When you filter by language=en, you'll get all datasets that include English — including XNLI, FLORES-200, mC4 multilingual archives — not only English-only datasets. Use the tags[] field to count how many languages each result covers if you want to disambiguate monolingual vs multilingual.

license ranges from clean to "lawyer this". Apache-2.0, MIT, BSD, CC0, MIT-0 are cleanest — full commercial use, attribution varies. CC-BY-4.0 requires attribution but allows commercial use. CC-BY-SA requires share-alike. CC-BY-NC and any *-nc-* variant prohibits commercial use — fine-tuning a commercial foundation model on a -nc- dataset is a license violation many ML teams discover too late. other and unknown are landmines. The license field passes through the SPDX value literally so downstream allowlist filters can include / exclude precisely.

Gated datasets. HF supports a gated flag — the dataset requires the user to accept a usage agreement (often a CC-BY-NC-style declaration, or an academic-use-only acknowledgement) before the download URL is unlocked. The hub UI shows a yellow banner; the API returns gated: true. This actor flags gated honestly so your downstream pipeline can route gated datasets to a human approver before any datasets.load_dataset() call.

Trending vs popular. sort=downloads ranks by all-time downloads — wikitext, glue, squad, imdb, c4, the_pile_v2 will permanently dominate. sort=trending is the rolling-7-day engagement leaderboard — newly-released benchmarks, viral instruction-tuning corpora, fresh evaluation suites surface here. sort=likes is the community-quality proxy. Use trending for "what's hot right now"; use downloads for "what are the canonical training corpora"; use likes for "what do other researchers vouch for".

cardData vs tags[]. The dataset card front-matter (cardData) is the authoritative source for task_categories, language, size_categories, license, and pretty_name — but many datasets have incomplete cards. The tags[] array is HF's fallback — it's auto-populated from cardData and from the dataset configuration files. This actor reads cardData first, then falls back to tags[] parsing, so you get the right answer regardless of how the uploader filled out their card.

Comparison vs alternatives

Capability	This actor	HF web UI	`huggingface_hub` SDK	Papers With Code	Kaggle Datasets	Zenodo
Top-N trending datasets	✅ flat CSV / JSON	⚠️ browse-only	⚠️ Python objects	⚠️ leaderboard-centric	⚠️ Kaggle-only	❌
Filter by task taxonomy	✅ HF canonical slugs	✅ but UI	✅ but Python	⚠️ different taxonomy	⚠️ tag-based	❌
Filter by language	✅ ISO 639 codes	✅ but UI	✅	⚠️ partial	⚠️	❌
Filter by size bucket	✅ HF buckets	✅ but UI	✅	❌	⚠️ file size only	⚠️
License (SPDX)	✅	✅	✅	⚠️	⚠️ Kaggle conventions	✅
Bulk export	✅ Apify dataset (CSV/JSON/Parquet/XLSX)	❌	⚠️ DIY normalization	❌	⚠️ Kaggle API	⚠️ DOI export
Auth required	❌	❌	❌ for public, ✅ for gated	❌	✅ Kaggle account	❌
Trending / momentum sort	✅ trending + likes + downloads	✅	⚠️ DIY	⚠️ paper momentum	⚠️	❌
Programmatic integration (webhook / REST)	✅ Apify	❌	✅ Python only	⚠️	⚠️ Kaggle API	⚠️ OAI-PMH
Pricing	$0.03/dataset, no subscription	Free	Free	Free	Free	Free

The headline distinction: this actor is the only way to get a flat, sortable, filterable CSV of the Hugging Face dataset registry — every task taxonomy slug, every language code, every size bucket, every license — without writing a custom client against the HF REST API and writing your own cardData ↔ tags[] normalization. The HF web UI is great for browsing one dataset at a time. The huggingface_hub SDK is great if you're already inside a Python ML pipeline. Papers With Code is great for tracking benchmark leaderboards. Kaggle Datasets is a separate registry entirely. Zenodo serves academic-DOI archives. This actor is for the ad-hoc ML research decision ("which 50 instruction-tuning corpora are trending this month?"), the compliance audit ("which top-200 image-classification datasets ship under CC-BY-NC?"), and the journalism / DevRel investigation ("what new code-generation datasets landed since GPT-5?").

Use cases

ML researchers picking training corpora. Filter task=text-classification, language=en, size=1M<n<10M, sort=downloads and you have a ranked CSV of every million-scale English text-classification dataset, by adoption. Pick the top-5 for your fine-tuning run, paste their ids into your datasets.load_dataset() calls. Same for task=automatic-speech-recognition, language=multilingual (ASR shootout), task=image-segmentation, sort=trending (vision research wave), task=reinforcement-learning, sort=likes (RLHF preference data).

MLOps teams auditing dataset lineage for model cards. When you ship a model, your model card needs to declare every training dataset, its license, its size, and its modality. Run search={your-model-name} or list the top-100 by downloads in your task and you have a citation-ready table for the model card. Cross-reference each id against your training-data manifest. Catch the case where someone fine-tuned on a cc-by-nc-4.0 dataset and the model card says "Apache-2.0" — the actor flags the dataset license honestly so the compliance audit catches the mismatch before the model ships.

Dataset curators tracking adoption. If you uploaded a dataset, run search={your-org} and watch downloads, likes, and last_modified over time. A weekly scheduled run lets you build your own adoption dashboard. Pair with huggingface-model-catalog (sister actor — link below) to track which fine-tuned models reference your dataset in their model card.

AI compliance / legal teams reviewing license exposure. Pull sort=downloads, maxResults=500 and filter the resulting CSV for license containing nc (non-commercial), sa (share-alike), other, or unknown — these are the four categories that need legal review before any commercial fine-tuning. The hub has hundreds of high-traffic datasets under cc-by-nc-4.0 (a research-only license) that ML teams routinely fine-tune on accidentally. The flat row drops the license into a WHERE license LIKE '%nc%' filter in your SQL warehouse.

Foundation-model builders sweeping for new pretraining data. Run sort=createdAt, size=100M<n<1B weekly — every multi-hundred-million-row corpus uploaded in the last 7 days, ranked freshest first. Pair with sort=trending, size=n>1T for trillion-row pretraining-scale releases (FineWeb, RedPajama-v2, The Pile-v2). The flat row tells you license, language, size_categories, and the URL in one screen.

Academic groups building benchmarks and eval suites. Run search=eval or search=benchmark filtered by task to find every existing eval suite in your domain. Pair with sort=likes for the community-vouched ones (MMLU, HellaSwag, GSM8K, HumanEval, MATH, IFEval, MT-Bench).

LLM-based research agents. Feed flat rows into a Claude / GPT prompt: "Here are the 100 most-downloaded English NLP datasets with licenses, sizes, and task categories. Recommend a balanced fine-tuning mix under an apache-2.0-only license policy." One schema, no per-dataset tool calls. Pair with our Hugging Face Models Catalog for symmetric model-side queries.

Journalists and AI-ecosystem analysts writing the "State of Open ML 2026": pull 1,000 datasets, group by task_categories and language, plot downloads distributions, identify the long tail vs the head. Drops straight into Pandas / Polars / Excel.

Sister actors in the NexGenData developer-intel fleet

This actor is part of NexGenData's developer, ML, and ecosystem intelligence scraping fleet. Pair it with:

Hugging Face Models Catalog — the model-side twin. Same Hub, same flat-row schema applied to the huggingface.co/api/models endpoint: id, author, downloads, likes, last_modified, pipeline_tag, library_name, tags[], license. Pair the two for symmetric "model + dataset" queries — pretraining-scale corpora ranked alongside the models trained on them, fine-tuning datasets next to the most-downloaded LoRA adapters, eval suites cross-referenced against the leaderboard models.
NPM Package Stats — the JS/TS twin: weekly/monthly/yearly downloads, 30d trend %, bundle size (Bundlephobia), Snyk vulns + severity, deps, license, maintainers. Useful for ML researchers shipping JS-side inference (transformers.js, onnxruntime-web, tfjs) — pair huggingface-model-catalog with npm-package-stats to track which models are being deployed via JS runtimes.
PyPI Package Stats — the Python twin: weekly/monthly downloads, 30d trend %, deps list, Snyk vulns, license. Every ML researcher's requirements.txt audit. Pair with this actor to track which datasets are being loaded via which datasets versions — and which alternative loaders (webdataset, mosaicml-streaming, dataloaders) are gaining momentum.
Crates.io Trending Packages — the Rust twin: trending crates with 90d + 30d downloads, license, deps, categories, keywords. Track the Rust-ML wave — candle-core (Hugging Face's Rust inference runtime), burn, ort, tch-rs, tract — alongside the datasets they're being benchmarked against.
GitHub Trending Repos — daily/weekly/monthly github.com/trending, enriched with license, topics, README excerpt, AI flag. A dataset whose source repo just hit GitHub Trending often shows a 5–20× weekly-downloads spike on the Hub in the next 72 hours. Cross-reference dataset adoption with repo-side momentum.

Pricing details (proposal — awaiting approval)

$0.00005 Actor Start (charged once per run, multiplied by RAM in GB)
$0.03 per dataset row emitted

Typical run cost:

Scenario	Datasets	Cost
Smoke test	10	$0.30
Weekly trending snapshot	50	$1.50
Top-100 task leaderboard	100	$3.00
Daily ecosystem sweep	200	$6.00
Full deep archive	1,000	$30.00

A daily 200-dataset sweep runs at about $180/month — substantially under any commercial ML-platform analytics tool, and you get to export to anywhere instead of being locked into one vendor dashboard.

Anti-bot / reliability

huggingface.co/api/datasets is a documented public JSON REST API. No auth for public datasets. The huggingface_hub Python SDK uses the same endpoint.
HF's documented unauthenticated quota is roughly 500 requests / 5 minutes / IP. A maxResults=1000 run fires ≤10 list-endpoint requests — well under the ceiling. The actor self-paces with a 150ms cushion between pages and retries on 429/503 with 2-4s backoff.
All requests carry a polite User-Agent: NexGenData scrapers@thenextgennexus.com.

Anti-bot risk: NONE. The Hugging Face Hub is the open ML-community registry and welcomes well-behaved API consumers.

Schema stability

The HF datasets REST API has been stable since 2021. The ?sort=, ?filter=, ?search=, ?limit=, ?offset=, ?full=true parameters are all documented and have shipped unchanged. The cardData blob and the tags[] array have both been stable since 2022. The canonical task_categories taxonomy is community-governed and rarely breaks — new task slugs get added (e.g. text-to-video, mask-generation) without invalidating existing ones. If HF ever introduces a breaking change, we'll add a data_source_version field and stage the migration without breaking existing pipelines.

Support

Issues, schema requests, custom fields: scrapers@thenextgennexus.com.

About NexGenData

NexGenData publishes 280+ buyer-intent Apify actors covering developer ecosystems (npm, PyPI, crates.io, Go modules, GitHub trending, Hacker News, Show HN, Product Hunt), ML data (Hugging Face datasets + models), SEC filings (Form 4 insider buys, Form D, 13F holdings, 8-K material events, Schedule 13D/G activist tracker), YC alumni, Delaware DOC, lead generation, competitive intelligence, stock fundamentals across 30+ global exchanges, property & macro data, and AI-MCP servers exposing all of the above to LLM agents.

All actors are pay-per-result — you only pay for rows you keep. No subscription, no seat licence, no annual contract.

Browse the full catalog and start your free trial: https://apify.com/nexgendata?fpr=2ayu9b

Sign up via that link and the free Apify platform credit covers your first hundred-plus rows on every actor — risk-free evaluation.

Hugging Face Datasets Scraper

parseforge/hugging-face-datasets-scraper

Scrape dataset metadata from Hugging Face Hub. Extract names, authors, download counts, likes, trending scores, task categories, size categories, languages, licenses, tags and descriptions. Filter by search query, task type, language, or license. Sort by trending, downloads, likes, or last modified.

ParseForge

Hugging Face Models Scraper — Search, Downloads, Likes, Tags

seemuapps/huggingface-models-scraper

Search Hugging Face for models by task, tag, or keyword and export downloads, likes, library, license, and tags to a clean dataset.

Andrew

Hugging Face Model Scraper

parseforge/hugging-face-model-scraper

Collect models from Hugging Face Hub via public API endpoints. Get metadata including author, downloads, likes, lastModified, task, library, license, tags and filenames.

ParseForge

5.0

Hugging Face Scraper - Models Datasets Spaces

openclawmara/huggingface-scraper

Scrape Hugging Face models, datasets, and Spaces. Extracts metadata, downloads, likes, tags, and usage stats. Ideal for AI model discovery, competitive analysis, and tracking trending ML resources.

OpenClaw Mara

Hugging Face Models Scraper

gio21/huggingface-models-scraper

Search and scrape Hugging Face models by task, library, or query. Returns id, downloads, likes, pipeline_tag, library_name, tags, last modified. Pay per model returned.

Gio

Hugging Face Models Scraper - Downloads, Likes, Trending, Tags

fetchcraft/huggingface-models-scraper

Search and scrape Hugging Face Models with downloads, likes, trending score, tags, license, library, pipeline. No API key. Filter by author (meta-llama, google, mistralai), pipeline (text-generation, image-to-text), or search. $0.001 per model. Free preview.

Emily Ward

Hugging Face Trending Scraper

funny_electrician/Korak1903

Hugging Face Trending Scraper: Tracks daily trending models and datasets to provide market intelligence.

Milton Gardener

Hugging Face Model & Dataset Scraper

cloud9_ai/huggingface-scraper

Search and extract ML models and datasets from Hugging Face Hub. Get model cards, download stats, tasks, and architectures. No API key needed.

cloud9

Huggingface Ai Scraper

skystone_labs/huggingface-ai-scraper

Extract AI/ML models, datasets, and spaces from Hugging Face with comprehensive metadata. Get download counts, likes, tags, task categories, library frameworks, and author information. Perfect for AI researchers, ML engineers, and data scientists tracking the open-source AI ecosystem.