Hugging Face Datasets Catalog — ML Training Data Intel
Pricing
from $30.00 / 1,000 datasets
Hugging Face Datasets Catalog — ML Training Data Intel
Hugging Face dataset registry: downloads, likes, last_modified, task_categories, language, size_categories, license, tags, author. Filter by task/language/size. Sort by downloads/likes/trending/modified. ML researchers, MLOps, AI compliance.
Pricing
from $30.00 / 1,000 datasets
Rating
0.0
(0)
Developer
Stephan Corbeil
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Hugging Face Datasets Catalog — ML Training Data Intel in One Flat Row
Pull a structured, filterable, sortable catalog of the Hugging Face Hub's dataset registry — the single largest open inventory of ML training and evaluation data on the public internet. Each dataset emits one flat row with id, author, total downloads, likes, last_modified, created_at, task_categories (the canonical Hugging Face task taxonomy — text-classification, automatic-speech-recognition, image-segmentation, reinforcement-learning, etc.), language (ISO-639 codes), size_categories bucket (n<1K, 1K<n<10K, ..., 100M<n<1B, n>1T), license (SPDX expression: apache-2.0, mit, cc-by-4.0, cc-by-nc-4.0, cc-by-sa-4.0, mit-0, bsd-3-clause, gpl-3.0, other, custom), full tags[] array, gated / disabled / private flags, a 600-char description excerpt, and the canonical huggingface.co/datasets/{id} URL. Built for ML researchers picking training corpora, MLOps teams auditing dataset lineage for model cards, dataset curators tracking who's downloading what, AI compliance / legal teams reviewing license exposure before fine-tuning, foundation-model builders sweeping the hub for new pretraining data, and academic groups building benchmarks and evaluation suites.
Price: $0.00005 per actor start + $0.03 per dataset row (proposal — awaiting Steve approval). A 10-dataset smoke test costs about $0.30, a 50-dataset weekly trending snapshot $1.50, a 200-dataset daily ecosystem sweep $6.00, a 1,000-dataset full archive $30.00.
Why this exists
The Hugging Face Hub hosts over 300,000 datasets as of mid-2026 — public corpora spanning canonical NLP benchmarks (GLUE, SuperGLUE, SQuAD, MMLU, HellaSwag, TruthfulQA, HumanEval), pretraining-scale archives (The Pile, RedPajama, FineWeb, C4, The Stack, mC4, OSCAR), multilingual coverage (FLORES-200, WMT, XNLI), audio (LibriSpeech, Common Voice, FLEURS), vision (ImageNet, COCO, LAION-5B, DataComp-1B), code (CodeSearchNet, StarCoder, MBPP), instruction-tuning (Alpaca, ShareGPT, OpenHermes, Tulu), preference (HH-RLHF, UltraFeedback, Nectar), and millions of community-uploaded niche corpora. The Hub web UI is excellent for browsing one dataset at a time — every page renders the card, the splits, the example viewer, the trending downloads chart — but it is not built for the bulk-discovery question that actually drives an ML researcher's workflow:
"Give me every English text-classification dataset above 100K rows, Apache-2.0-licensed, ranked by downloads — as a CSV I can paste into Notion."
There is no "export the dataset registry as CSV" button. The huggingface_hub Python library has list_datasets() but returns Python objects, not flat row-shaped output that drops into Snowflake, BigQuery, Postgres, Looker, Tableau, or a notebook. The public REST API (https://huggingface.co/api/datasets) is real, documented, requires no auth, and accepts ?sort=, ?filter=, ?search=, &limit=, &offset= params — but using it from a spreadsheet means writing your own client, your own pagination loop, your own normalization of the cardData blob vs the flat tags[] array (the same field — say, language — appears in both with slightly different shapes).
This actor does the join for you and emits one flat row per dataset — drop-in for Snowflake, BigQuery, Postgres, Looker, Tableau, an Excel pivot, a Slack daily digest, or a Claude / GPT prompt that ranks the corpus landscape.
No Hugging Face token. No login. No huggingface_hub.login() boilerplate. Public-only data, polite User-Agent (NexGenData scrapers@thenextgennexus.com), rate-limited to stay well under the documented unauthenticated quota (~500 requests per 5 minutes).
What you get — schema
Each dataset row contains:
| Field | Type | Source | Notes |
|---|---|---|---|
id | string | HF list | Canonical hub id — either name (legacy canonical) or org/name. Pass this to datasets.load_dataset(id). |
pretty_name | string | cardData.pretty_name | Human-readable display name from the dataset card front-matter (often title-case with spaces). |
author | string | derived | The hub org/user that uploaded the dataset (split from org/name, or cardData.author). |
downloads | int | HF list | All-time datasets.load_dataset() + direct-file download count. |
likes | int | HF list | Community upvotes — better quality proxy than raw downloads for non-pretraining use. |
last_modified | ISO8601 | HF list | Last commit timestamp on the dataset repo. |
created_at | ISO8601 | HF list | Initial upload timestamp. |
task_categories | list | cardData.task_categories + tags[] | Canonical HF task taxonomy slugs (text-classification, summarization, automatic-speech-recognition, text-to-image, …). |
language | list | cardData.language + tags[] | ISO 639-1 / 639-3 codes (en, fr, de, zh, ja, multilingual, code). |
size_categories | string | cardData.size_categories + tags[] | HF size bucket: n<1K, 1K<n<10K, 10K<n<100K, 100K<n<1M, 1M<n<10M, 10M<n<100M, 100M<n<1B, n>1T. |
license | string | cardData.license + tags[] | SPDX expression (apache-2.0, mit, cc-by-4.0, cc-by-nc-4.0, cc-by-sa-4.0, cc0-1.0, bsd-3-clause, gpl-3.0, lgpl-3.0, mpl-2.0, unknown, other). |
tags | list | HF list | Full raw tags array — task_categories:..., language:..., license:..., size_categories:..., modality:..., format:..., library:..., region:..., arxiv:..., custom user tags. |
gated | bool | HF list | True if the dataset requires accepting a gating agreement before download. |
disabled | bool | HF list | True if the dataset is currently disabled by the hub (DMCA, TOS, etc.). |
private | bool | HF list | True if private — should be false on the public listing. |
description | string | HF list | First 600 chars of the dataset card description (markdown stripped to whitespace). |
dataset_url | string | derived | https://huggingface.co/datasets/{id} — direct link for QA / spot-check. |
data_source | string | — | huggingface.co/api/datasets. |
Input
| Field | Type | Default | What it does |
|---|---|---|---|
task | string | "" | Single HF task slug filter (text-classification, question-answering, image-classification, automatic-speech-recognition, …). Passed to the API as filter=task_categories:{slug}. |
language | string | "" | Single ISO 639 language code filter (en, fr, zh, multilingual, …). Passed as filter=language:{code}. |
size | enum | "" | Single size bucket filter (n<1K, 1K<n<10K, ..., 100M<n<1B, n>1T). Passed as filter=size_categories:{bucket}. |
sort | enum | downloads | downloads, likes, trending, lastModified, createdAt. |
search | string | "" | Free-text search on id + description. |
maxResults | int 1–1000 | 50 | Hard cap on rows emitted. |
Plain-English Hugging Face dataset mechanics
If you're a buyer (compliance lead, journalist, product manager) without an ML background, here are the platform quirks that matter:
task_categories is the headline taxonomy. Every well-maintained dataset declares one or more task_categories in its dataset-card front-matter — the canonical answer to "what is this dataset for?" — text-classification for sentiment / topic / spam, question-answering for SQuAD-style, automatic-speech-recognition for ASR, text-to-image for diffusion, reinforcement-learning for RLHF. When you filter by task=, you're filtering on this canonical column — not on a free-text guess.
size_categories is an order-of-magnitude bucket, not an exact row count. HF defines fixed buckets and the uploader picks one. Size refers to the row count of the largest split (typically train), not file size on disk. Use size_categories to filter "toy benchmark" (n<1K, 1K<n<10K) from "fine-tuning scale" (100K<n<1M, 1M<n<10M) from "pretraining scale" (100M<n<1B, n>1T).
language is multi-valued. A multilingual dataset will carry one language:en, one language:fr, one language:de, etc. tag per language it covers. The actor's language[] field preserves all of them. When you filter by language=en, you'll get all datasets that include English — including XNLI, FLORES-200, mC4 multilingual archives — not only English-only datasets. Use the tags[] field to count how many languages each result covers if you want to disambiguate monolingual vs multilingual.
license ranges from clean to "lawyer this". Apache-2.0, MIT, BSD, CC0, MIT-0 are cleanest — full commercial use, attribution varies. CC-BY-4.0 requires attribution but allows commercial use. CC-BY-SA requires share-alike. CC-BY-NC and any *-nc-* variant prohibits commercial use — fine-tuning a commercial foundation model on a -nc- dataset is a license violation many ML teams discover too late. other and unknown are landmines. The license field passes through the SPDX value literally so downstream allowlist filters can include / exclude precisely.
Gated datasets. HF supports a gated flag — the dataset requires the user to accept a usage agreement (often a CC-BY-NC-style declaration, or an academic-use-only acknowledgement) before the download URL is unlocked. The hub UI shows a yellow banner; the API returns gated: true. This actor flags gated honestly so your downstream pipeline can route gated datasets to a human approver before any datasets.load_dataset() call.
Trending vs popular. sort=downloads ranks by all-time downloads — wikitext, glue, squad, imdb, c4, the_pile_v2 will permanently dominate. sort=trending is the rolling-7-day engagement leaderboard — newly-released benchmarks, viral instruction-tuning corpora, fresh evaluation suites surface here. sort=likes is the community-quality proxy. Use trending for "what's hot right now"; use downloads for "what are the canonical training corpora"; use likes for "what do other researchers vouch for".
cardData vs tags[]. The dataset card front-matter (cardData) is the authoritative source for task_categories, language, size_categories, license, and pretty_name — but many datasets have incomplete cards. The tags[] array is HF's fallback — it's auto-populated from cardData and from the dataset configuration files. This actor reads cardData first, then falls back to tags[] parsing, so you get the right answer regardless of how the uploader filled out their card.
Comparison vs alternatives
| Capability | This actor | HF web UI | huggingface_hub SDK | Papers With Code | Kaggle Datasets | Zenodo |
|---|---|---|---|---|---|---|
| Top-N trending datasets | ✅ flat CSV / JSON | ⚠️ browse-only | ⚠️ Python objects | ⚠️ leaderboard-centric | ⚠️ Kaggle-only | ❌ |
| Filter by task taxonomy | ✅ HF canonical slugs | ✅ but UI | ✅ but Python | ⚠️ different taxonomy | ⚠️ tag-based | ❌ |
| Filter by language | ✅ ISO 639 codes | ✅ but UI | ✅ | ⚠️ partial | ⚠️ | ❌ |
| Filter by size bucket | ✅ HF buckets | ✅ but UI | ✅ | ❌ | ⚠️ file size only | ⚠️ |
| License (SPDX) | ✅ | ✅ | ✅ | ⚠️ | ⚠️ Kaggle conventions | ✅ |
| Bulk export | ✅ Apify dataset (CSV/JSON/Parquet/XLSX) | ❌ | ⚠️ DIY normalization | ❌ | ⚠️ Kaggle API | ⚠️ DOI export |
| Auth required | ❌ | ❌ | ❌ for public, ✅ for gated | ❌ | ✅ Kaggle account | ❌ |
| Trending / momentum sort | ✅ trending + likes + downloads | ✅ | ⚠️ DIY | ⚠️ paper momentum | ⚠️ | ❌ |
| Programmatic integration (webhook / REST) | ✅ Apify | ❌ | ✅ Python only | ⚠️ | ⚠️ Kaggle API | ⚠️ OAI-PMH |
| Pricing | $0.03/dataset, no subscription | Free | Free | Free | Free | Free |
The headline distinction: this actor is the only way to get a flat, sortable, filterable CSV of the Hugging Face dataset registry — every task taxonomy slug, every language code, every size bucket, every license — without writing a custom client against the HF REST API and writing your own cardData ↔ tags[] normalization. The HF web UI is great for browsing one dataset at a time. The huggingface_hub SDK is great if you're already inside a Python ML pipeline. Papers With Code is great for tracking benchmark leaderboards. Kaggle Datasets is a separate registry entirely. Zenodo serves academic-DOI archives. This actor is for the ad-hoc ML research decision ("which 50 instruction-tuning corpora are trending this month?"), the compliance audit ("which top-200 image-classification datasets ship under CC-BY-NC?"), and the journalism / DevRel investigation ("what new code-generation datasets landed since GPT-5?").
Use cases
ML researchers picking training corpora. Filter task=text-classification, language=en, size=1M<n<10M, sort=downloads and you have a ranked CSV of every million-scale English text-classification dataset, by adoption. Pick the top-5 for your fine-tuning run, paste their ids into your datasets.load_dataset() calls. Same for task=automatic-speech-recognition, language=multilingual (ASR shootout), task=image-segmentation, sort=trending (vision research wave), task=reinforcement-learning, sort=likes (RLHF preference data).
MLOps teams auditing dataset lineage for model cards. When you ship a model, your model card needs to declare every training dataset, its license, its size, and its modality. Run search={your-model-name} or list the top-100 by downloads in your task and you have a citation-ready table for the model card. Cross-reference each id against your training-data manifest. Catch the case where someone fine-tuned on a cc-by-nc-4.0 dataset and the model card says "Apache-2.0" — the actor flags the dataset license honestly so the compliance audit catches the mismatch before the model ships.
Dataset curators tracking adoption. If you uploaded a dataset, run search={your-org} and watch downloads, likes, and last_modified over time. A weekly scheduled run lets you build your own adoption dashboard. Pair with huggingface-model-catalog (sister actor — link below) to track which fine-tuned models reference your dataset in their model card.
AI compliance / legal teams reviewing license exposure. Pull sort=downloads, maxResults=500 and filter the resulting CSV for license containing nc (non-commercial), sa (share-alike), other, or unknown — these are the four categories that need legal review before any commercial fine-tuning. The hub has hundreds of high-traffic datasets under cc-by-nc-4.0 (a research-only license) that ML teams routinely fine-tune on accidentally. The flat row drops the license into a WHERE license LIKE '%nc%' filter in your SQL warehouse.
Foundation-model builders sweeping for new pretraining data. Run sort=createdAt, size=100M<n<1B weekly — every multi-hundred-million-row corpus uploaded in the last 7 days, ranked freshest first. Pair with sort=trending, size=n>1T for trillion-row pretraining-scale releases (FineWeb, RedPajama-v2, The Pile-v2). The flat row tells you license, language, size_categories, and the URL in one screen.
Academic groups building benchmarks and eval suites. Run search=eval or search=benchmark filtered by task to find every existing eval suite in your domain. Pair with sort=likes for the community-vouched ones (MMLU, HellaSwag, GSM8K, HumanEval, MATH, IFEval, MT-Bench).
LLM-based research agents. Feed flat rows into a Claude / GPT prompt: "Here are the 100 most-downloaded English NLP datasets with licenses, sizes, and task categories. Recommend a balanced fine-tuning mix under an apache-2.0-only license policy." One schema, no per-dataset tool calls. Pair with our Hugging Face Models Catalog for symmetric model-side queries.
Journalists and AI-ecosystem analysts writing the "State of Open ML 2026": pull 1,000 datasets, group by task_categories and language, plot downloads distributions, identify the long tail vs the head. Drops straight into Pandas / Polars / Excel.
Sister actors in the NexGenData developer-intel fleet
This actor is part of NexGenData's developer, ML, and ecosystem intelligence scraping fleet. Pair it with:
- Hugging Face Models Catalog — the model-side twin. Same Hub, same flat-row schema applied to the
huggingface.co/api/modelsendpoint:id,author,downloads,likes,last_modified,pipeline_tag,library_name,tags[],license. Pair the two for symmetric "model + dataset" queries — pretraining-scale corpora ranked alongside the models trained on them, fine-tuning datasets next to the most-downloaded LoRA adapters, eval suites cross-referenced against the leaderboard models. - NPM Package Stats — the JS/TS twin: weekly/monthly/yearly downloads, 30d trend %, bundle size (Bundlephobia), Snyk vulns + severity, deps, license, maintainers. Useful for ML researchers shipping JS-side inference (
transformers.js,onnxruntime-web,tfjs) — pairhuggingface-model-catalogwithnpm-package-statsto track which models are being deployed via JS runtimes. - PyPI Package Stats — the Python twin: weekly/monthly downloads, 30d trend %, deps list, Snyk vulns, license. Every ML researcher's
requirements.txtaudit. Pair with this actor to track which datasets are being loaded via whichdatasetsversions — and which alternative loaders (webdataset,mosaicml-streaming,dataloaders) are gaining momentum. - Crates.io Trending Packages — the Rust twin: trending crates with 90d + 30d downloads, license, deps, categories, keywords. Track the Rust-ML wave —
candle-core(Hugging Face's Rust inference runtime),burn,ort,tch-rs,tract— alongside the datasets they're being benchmarked against. - GitHub Trending Repos — daily/weekly/monthly
github.com/trending, enriched with license, topics, README excerpt, AI flag. A dataset whose source repo just hit GitHub Trending often shows a 5–20× weekly-downloads spike on the Hub in the next 72 hours. Cross-reference dataset adoption with repo-side momentum.
Pricing details (proposal — awaiting approval)
- $0.00005 Actor Start (charged once per run, multiplied by RAM in GB)
- $0.03 per dataset row emitted
Typical run cost:
| Scenario | Datasets | Cost |
|---|---|---|
| Smoke test | 10 | $0.30 |
| Weekly trending snapshot | 50 | $1.50 |
| Top-100 task leaderboard | 100 | $3.00 |
| Daily ecosystem sweep | 200 | $6.00 |
| Full deep archive | 1,000 | $30.00 |
A daily 200-dataset sweep runs at about $180/month — substantially under any commercial ML-platform analytics tool, and you get to export to anywhere instead of being locked into one vendor dashboard.
Anti-bot / reliability
huggingface.co/api/datasetsis a documented public JSON REST API. No auth for public datasets. Thehuggingface_hubPython SDK uses the same endpoint.- HF's documented unauthenticated quota is roughly 500 requests / 5 minutes / IP. A
maxResults=1000run fires ≤10 list-endpoint requests — well under the ceiling. The actor self-paces with a 150ms cushion between pages and retries on 429/503 with 2-4s backoff. - All requests carry a polite User-Agent:
NexGenData scrapers@thenextgennexus.com.
Anti-bot risk: NONE. The Hugging Face Hub is the open ML-community registry and welcomes well-behaved API consumers.
Schema stability
The HF datasets REST API has been stable since 2021. The ?sort=, ?filter=, ?search=, ?limit=, ?offset=, ?full=true parameters are all documented and have shipped unchanged. The cardData blob and the tags[] array have both been stable since 2022. The canonical task_categories taxonomy is community-governed and rarely breaks — new task slugs get added (e.g. text-to-video, mask-generation) without invalidating existing ones. If HF ever introduces a breaking change, we'll add a data_source_version field and stage the migration without breaking existing pipelines.
Support
Issues, schema requests, custom fields: scrapers@thenextgennexus.com.
About NexGenData
NexGenData publishes 280+ buyer-intent Apify actors covering developer ecosystems (npm, PyPI, crates.io, Go modules, GitHub trending, Hacker News, Show HN, Product Hunt), ML data (Hugging Face datasets + models), SEC filings (Form 4 insider buys, Form D, 13F holdings, 8-K material events, Schedule 13D/G activist tracker), YC alumni, Delaware DOC, lead generation, competitive intelligence, stock fundamentals across 30+ global exchanges, property & macro data, and AI-MCP servers exposing all of the above to LLM agents.
All actors are pay-per-result — you only pay for rows you keep. No subscription, no seat licence, no annual contract.
Browse the full catalog and start your free trial: https://apify.com/nexgendata?fpr=2ayu9b
Sign up via that link and the free Apify platform credit covers your first hundred-plus rows on every actor — risk-free evaluation.