# Hugging Face Datasets Catalog — ML Training Data Intel (`nexgendata/huggingface-datasets-catalog`) Actor

Hugging Face dataset registry: downloads, likes, last\_modified, task\_categories, language, size\_categories, license, tags, author. Filter by task/language/size. Sort by downloads/likes/trending/modified. ML researchers, MLOps, AI compliance.

- **URL**: https://apify.com/nexgendata/huggingface-datasets-catalog.md
- **Developed by:** [Stephan Corbeil](https://apify.com/nexgendata) (community)
- **Categories:** Developer tools, Business
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $30.00 / 1,000 datasets

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Hugging Face Datasets Catalog — ML Training Data Intel in One Flat Row

Pull a structured, filterable, sortable catalog of the Hugging Face Hub's dataset registry — the single largest open inventory of ML training and evaluation data on the public internet. Each dataset emits one flat row with `id`, `author`, total `downloads`, `likes`, `last_modified`, `created_at`, `task_categories` (the canonical Hugging Face task taxonomy — `text-classification`, `automatic-speech-recognition`, `image-segmentation`, `reinforcement-learning`, etc.), `language` (ISO-639 codes), `size_categories` bucket (`n<1K`, `1K<n<10K`, ..., `100M<n<1B`, `n>1T`), `license` (SPDX expression: `apache-2.0`, `mit`, `cc-by-4.0`, `cc-by-nc-4.0`, `cc-by-sa-4.0`, `mit-0`, `bsd-3-clause`, `gpl-3.0`, `other`, custom), full `tags[]` array, gated / disabled / private flags, a 600-char description excerpt, and the canonical `huggingface.co/datasets/{id}` URL. Built for **ML researchers** picking training corpora, **MLOps teams** auditing dataset lineage for model cards, **dataset curators** tracking who's downloading what, **AI compliance / legal teams** reviewing license exposure before fine-tuning, **foundation-model builders** sweeping the hub for new pretraining data, and **academic groups** building benchmarks and evaluation suites.

**Price:** $0.00005 per actor start + $0.03 per dataset row (proposal — awaiting Steve approval). A 10-dataset smoke test costs about **$0.30**, a 50-dataset weekly trending snapshot **$1.50**, a 200-dataset daily ecosystem sweep **$6.00**, a 1,000-dataset full archive **$30.00**.

---

### Why this exists

The Hugging Face Hub hosts over **300,000 datasets** as of mid-2026 — public corpora spanning canonical NLP benchmarks (GLUE, SuperGLUE, SQuAD, MMLU, HellaSwag, TruthfulQA, HumanEval), pretraining-scale archives (The Pile, RedPajama, FineWeb, C4, The Stack, mC4, OSCAR), multilingual coverage (FLORES-200, WMT, XNLI), audio (LibriSpeech, Common Voice, FLEURS), vision (ImageNet, COCO, LAION-5B, DataComp-1B), code (CodeSearchNet, StarCoder, MBPP), instruction-tuning (Alpaca, ShareGPT, OpenHermes, Tulu), preference (HH-RLHF, UltraFeedback, Nectar), and millions of community-uploaded niche corpora. The Hub web UI is excellent for browsing one dataset at a time — every page renders the card, the splits, the example viewer, the trending downloads chart — but it is not built for the bulk-discovery question that *actually* drives an ML researcher's workflow:

> *"Give me every English text-classification dataset above 100K rows, Apache-2.0-licensed, ranked by downloads — as a CSV I can paste into Notion."*

There is no "export the dataset registry as CSV" button. The `huggingface_hub` Python library has `list_datasets()` but returns Python objects, not flat row-shaped output that drops into Snowflake, BigQuery, Postgres, Looker, Tableau, or a notebook. The public REST API (`https://huggingface.co/api/datasets`) is real, documented, requires no auth, and accepts `?sort=`, `?filter=`, `?search=`, `&limit=`, `&offset=` params — but using it from a spreadsheet means writing your own client, your own pagination loop, your own normalization of the `cardData` blob vs the flat `tags[]` array (the same field — say, `language` — appears in both with slightly different shapes).

This actor does the join for you and emits one flat row per dataset — drop-in for Snowflake, BigQuery, Postgres, Looker, Tableau, an Excel pivot, a Slack daily digest, or a Claude / GPT prompt that ranks the corpus landscape.

No Hugging Face token. No login. No `huggingface_hub.login()` boilerplate. Public-only data, polite User-Agent (`NexGenData scrapers@thenextgennexus.com`), rate-limited to stay well under the documented unauthenticated quota (~500 requests per 5 minutes).

---

### What you get — schema

Each dataset row contains:

| Field | Type | Source | Notes |
|---|---|---|---|
| `id` | string | HF list | Canonical hub id — either `name` (legacy canonical) or `org/name`. Pass this to `datasets.load_dataset(id)`. |
| `pretty_name` | string | `cardData.pretty_name` | Human-readable display name from the dataset card front-matter (often title-case with spaces). |
| `author` | string | derived | The hub org/user that uploaded the dataset (split from `org/name`, or `cardData.author`). |
| `downloads` | int | HF list | All-time `datasets.load_dataset()` + direct-file download count. |
| `likes` | int | HF list | Community upvotes — better quality proxy than raw downloads for non-pretraining use. |
| `last_modified` | ISO8601 | HF list | Last commit timestamp on the dataset repo. |
| `created_at` | ISO8601 | HF list | Initial upload timestamp. |
| `task_categories` | list<string> | `cardData.task_categories` + `tags[]` | Canonical HF task taxonomy slugs (`text-classification`, `summarization`, `automatic-speech-recognition`, `text-to-image`, …). |
| `language` | list<string> | `cardData.language` + `tags[]` | ISO 639-1 / 639-3 codes (`en`, `fr`, `de`, `zh`, `ja`, `multilingual`, `code`). |
| `size_categories` | string | `cardData.size_categories` + `tags[]` | HF size bucket: `n<1K`, `1K<n<10K`, `10K<n<100K`, `100K<n<1M`, `1M<n<10M`, `10M<n<100M`, `100M<n<1B`, `n>1T`. |
| `license` | string | `cardData.license` + `tags[]` | SPDX expression (`apache-2.0`, `mit`, `cc-by-4.0`, `cc-by-nc-4.0`, `cc-by-sa-4.0`, `cc0-1.0`, `bsd-3-clause`, `gpl-3.0`, `lgpl-3.0`, `mpl-2.0`, `unknown`, `other`). |
| `tags` | list<string> | HF list | Full raw tags array — `task_categories:...`, `language:...`, `license:...`, `size_categories:...`, `modality:...`, `format:...`, `library:...`, `region:...`, `arxiv:...`, custom user tags. |
| `gated` | bool | HF list | True if the dataset requires accepting a gating agreement before download. |
| `disabled` | bool | HF list | True if the dataset is currently disabled by the hub (DMCA, TOS, etc.). |
| `private` | bool | HF list | True if private — should be `false` on the public listing. |
| `description` | string | HF list | First 600 chars of the dataset card description (markdown stripped to whitespace). |
| `dataset_url` | string | derived | `https://huggingface.co/datasets/{id}` — direct link for QA / spot-check. |
| `data_source` | string | — | `huggingface.co/api/datasets`. |

---

### Input

| Field | Type | Default | What it does |
|---|---|---|---|
| `task` | string | `""` | Single HF task slug filter (`text-classification`, `question-answering`, `image-classification`, `automatic-speech-recognition`, …). Passed to the API as `filter=task_categories:{slug}`. |
| `language` | string | `""` | Single ISO 639 language code filter (`en`, `fr`, `zh`, `multilingual`, …). Passed as `filter=language:{code}`. |
| `size` | enum | `""` | Single size bucket filter (`n<1K`, `1K<n<10K`, ..., `100M<n<1B`, `n>1T`). Passed as `filter=size_categories:{bucket}`. |
| `sort` | enum | `downloads` | `downloads`, `likes`, `trending`, `lastModified`, `createdAt`. |
| `search` | string | `""` | Free-text search on `id` + `description`. |
| `maxResults` | int 1–1000 | 50 | Hard cap on rows emitted. |

---

### Plain-English Hugging Face dataset mechanics

If you're a buyer (compliance lead, journalist, product manager) without an ML background, here are the platform quirks that matter:

**`task_categories` is the headline taxonomy.** Every well-maintained dataset declares one or more `task_categories` in its dataset-card front-matter — the canonical answer to "what is this dataset *for*?" — `text-classification` for sentiment / topic / spam, `question-answering` for SQuAD-style, `automatic-speech-recognition` for ASR, `text-to-image` for diffusion, `reinforcement-learning` for RLHF. When you filter by `task=`, you're filtering on this canonical column — not on a free-text guess.

**`size_categories` is an order-of-magnitude bucket, not an exact row count.** HF defines fixed buckets and the uploader picks one. Size refers to the row count of the largest split (typically `train`), not file size on disk. Use `size_categories` to filter "toy benchmark" (`n<1K`, `1K<n<10K`) from "fine-tuning scale" (`100K<n<1M`, `1M<n<10M`) from "pretraining scale" (`100M<n<1B`, `n>1T`).

**`language` is multi-valued.** A multilingual dataset will carry one `language:en`, one `language:fr`, one `language:de`, etc. tag per language it covers. The actor's `language[]` field preserves all of them. When you filter by `language=en`, you'll get *all* datasets that include English — including XNLI, FLORES-200, mC4 multilingual archives — not only English-only datasets. Use the `tags[]` field to count how many languages each result covers if you want to disambiguate monolingual vs multilingual.

**`license` ranges from clean to "lawyer this".** Apache-2.0, MIT, BSD, CC0, MIT-0 are cleanest — full commercial use, attribution varies. CC-BY-4.0 requires attribution but allows commercial use. CC-BY-SA requires share-alike. **CC-BY-NC and any `*-nc-*` variant** prohibits commercial use — fine-tuning a commercial foundation model on a `-nc-` dataset is a license violation many ML teams discover too late. `other` and `unknown` are landmines. The `license` field passes through the SPDX value literally so downstream allowlist filters can include / exclude precisely.

**Gated datasets.** HF supports a `gated` flag — the dataset requires the user to accept a usage agreement (often a CC-BY-NC-style declaration, or an academic-use-only acknowledgement) before the download URL is unlocked. The hub UI shows a yellow banner; the API returns `gated: true`. This actor flags `gated` honestly so your downstream pipeline can route gated datasets to a human approver before any `datasets.load_dataset()` call.

**Trending vs popular.** `sort=downloads` ranks by *all-time* downloads — `wikitext`, `glue`, `squad`, `imdb`, `c4`, `the_pile_v2` will permanently dominate. `sort=trending` is the rolling-7-day engagement leaderboard — newly-released benchmarks, viral instruction-tuning corpora, fresh evaluation suites surface here. `sort=likes` is the community-quality proxy. Use `trending` for "what's hot right now"; use `downloads` for "what are the canonical training corpora"; use `likes` for "what do other researchers vouch for".

**`cardData` vs `tags[]`.** The dataset card front-matter (`cardData`) is the authoritative source for `task_categories`, `language`, `size_categories`, `license`, and `pretty_name` — but many datasets have incomplete cards. The `tags[]` array is HF's fallback — it's auto-populated from `cardData` *and* from the dataset configuration files. This actor reads `cardData` first, then falls back to `tags[]` parsing, so you get the right answer regardless of how the uploader filled out their card.

---

### Comparison vs alternatives

| Capability | This actor | HF web UI | `huggingface_hub` SDK | Papers With Code | Kaggle Datasets | Zenodo |
|---|---|---|---|---|---|---|
| Top-N trending datasets | ✅ flat CSV / JSON | ⚠️ browse-only | ⚠️ Python objects | ⚠️ leaderboard-centric | ⚠️ Kaggle-only | ❌ |
| Filter by task taxonomy | ✅ HF canonical slugs | ✅ but UI | ✅ but Python | ⚠️ different taxonomy | ⚠️ tag-based | ❌ |
| Filter by language | ✅ ISO 639 codes | ✅ but UI | ✅ | ⚠️ partial | ⚠️ | ❌ |
| Filter by size bucket | ✅ HF buckets | ✅ but UI | ✅ | ❌ | ⚠️ file size only | ⚠️ |
| License (SPDX) | ✅ | ✅ | ✅ | ⚠️ | ⚠️ Kaggle conventions | ✅ |
| Bulk export | ✅ Apify dataset (CSV/JSON/Parquet/XLSX) | ❌ | ⚠️ DIY normalization | ❌ | ⚠️ Kaggle API | ⚠️ DOI export |
| Auth required | ❌ | ❌ | ❌ for public, ✅ for gated | ❌ | ✅ Kaggle account | ❌ |
| Trending / momentum sort | ✅ trending + likes + downloads | ✅ | ⚠️ DIY | ⚠️ paper momentum | ⚠️ | ❌ |
| Programmatic integration (webhook / REST) | ✅ Apify | ❌ | ✅ Python only | ⚠️ | ⚠️ Kaggle API | ⚠️ OAI-PMH |
| Pricing | $0.03/dataset, no subscription | Free | Free | Free | Free | Free |

**The headline distinction:** this actor is the only way to get a flat, sortable, filterable CSV of the Hugging Face dataset registry — every task taxonomy slug, every language code, every size bucket, every license — without writing a custom client against the HF REST API and writing your own `cardData` ↔ `tags[]` normalization. The HF web UI is great for browsing one dataset at a time. The `huggingface_hub` SDK is great if you're already inside a Python ML pipeline. Papers With Code is great for tracking benchmark leaderboards. Kaggle Datasets is a separate registry entirely. Zenodo serves academic-DOI archives. This actor is for the ad-hoc ML research decision ("which 50 instruction-tuning corpora are trending this month?"), the compliance audit ("which top-200 image-classification datasets ship under CC-BY-NC?"), and the journalism / DevRel investigation ("what new code-generation datasets landed since GPT-5?").

---

### Use cases

**ML researchers picking training corpora.** Filter `task=text-classification, language=en, size=1M<n<10M, sort=downloads` and you have a ranked CSV of every million-scale English text-classification dataset, by adoption. Pick the top-5 for your fine-tuning run, paste their `id`s into your `datasets.load_dataset()` calls. Same for `task=automatic-speech-recognition, language=multilingual` (ASR shootout), `task=image-segmentation, sort=trending` (vision research wave), `task=reinforcement-learning, sort=likes` (RLHF preference data).

**MLOps teams auditing dataset lineage for model cards.** When you ship a model, your model card needs to declare every training dataset, its license, its size, and its modality. Run `search={your-model-name}` or list the top-100 by downloads in your task and you have a citation-ready table for the model card. Cross-reference each `id` against your training-data manifest. Catch the case where someone fine-tuned on a `cc-by-nc-4.0` dataset and the model card says "Apache-2.0" — the actor flags the dataset license honestly so the compliance audit catches the mismatch before the model ships.

**Dataset curators tracking adoption.** If you uploaded a dataset, run `search={your-org}` and watch `downloads`, `likes`, and `last_modified` over time. A weekly scheduled run lets you build your own adoption dashboard. Pair with `huggingface-model-catalog` (sister actor — link below) to track which fine-tuned models reference your dataset in their model card.

**AI compliance / legal teams reviewing license exposure.** Pull `sort=downloads, maxResults=500` and filter the resulting CSV for `license` containing `nc` (non-commercial), `sa` (share-alike), `other`, or `unknown` — these are the four categories that need legal review before any commercial fine-tuning. The hub has hundreds of high-traffic datasets under `cc-by-nc-4.0` (a research-only license) that ML teams routinely fine-tune on accidentally. The flat row drops the license into a `WHERE license LIKE '%nc%'` filter in your SQL warehouse.

**Foundation-model builders sweeping for new pretraining data.** Run `sort=createdAt, size=100M<n<1B` weekly — every multi-hundred-million-row corpus uploaded in the last 7 days, ranked freshest first. Pair with `sort=trending, size=n>1T` for trillion-row pretraining-scale releases (FineWeb, RedPajama-v2, The Pile-v2). The flat row tells you `license`, `language`, `size_categories`, and the URL in one screen.

**Academic groups building benchmarks and eval suites.** Run `search=eval` or `search=benchmark` filtered by `task` to find every existing eval suite in your domain. Pair with `sort=likes` for the community-vouched ones (MMLU, HellaSwag, GSM8K, HumanEval, MATH, IFEval, MT-Bench).

**LLM-based research agents.** Feed flat rows into a Claude / GPT prompt: "Here are the 100 most-downloaded English NLP datasets with licenses, sizes, and task categories. Recommend a balanced fine-tuning mix under an `apache-2.0`-only license policy." One schema, no per-dataset tool calls. Pair with our **Hugging Face Models Catalog** for symmetric model-side queries.

**Journalists and AI-ecosystem analysts** writing the "State of Open ML 2026": pull 1,000 datasets, group by `task_categories` and `language`, plot `downloads` distributions, identify the long tail vs the head. Drops straight into Pandas / Polars / Excel.

---

### Sister actors in the NexGenData developer-intel fleet

This actor is part of NexGenData's **developer, ML, and ecosystem intelligence** scraping fleet. Pair it with:

- **[Hugging Face Models Catalog](https://apify.com/nexgendata/huggingface-model-catalog?fpr=2ayu9b)** — the model-side twin. Same Hub, same flat-row schema applied to the `huggingface.co/api/models` endpoint: `id`, `author`, `downloads`, `likes`, `last_modified`, `pipeline_tag`, `library_name`, `tags[]`, `license`. Pair the two for symmetric "model + dataset" queries — pretraining-scale corpora ranked alongside the models trained on them, fine-tuning datasets next to the most-downloaded LoRA adapters, eval suites cross-referenced against the leaderboard models.
- **[NPM Package Stats](https://apify.com/nexgendata/npm-package-stats?fpr=2ayu9b)** — the JS/TS twin: weekly/monthly/yearly downloads, 30d trend %, bundle size (Bundlephobia), Snyk vulns + severity, deps, license, maintainers. Useful for ML researchers shipping JS-side inference (`transformers.js`, `onnxruntime-web`, `tfjs`) — pair `huggingface-model-catalog` with `npm-package-stats` to track which models are being deployed via JS runtimes.
- **[PyPI Package Stats](https://apify.com/nexgendata/pypi-package-stats?fpr=2ayu9b)** — the Python twin: weekly/monthly downloads, 30d trend %, deps list, Snyk vulns, license. Every ML researcher's `requirements.txt` audit. Pair with this actor to track which datasets are being loaded via which `datasets` versions — and which alternative loaders (`webdataset`, `mosaicml-streaming`, `dataloaders`) are gaining momentum.
- **[Crates.io Trending Packages](https://apify.com/nexgendata/crates-io-trending-packages?fpr=2ayu9b)** — the Rust twin: trending crates with 90d + 30d downloads, license, deps, categories, keywords. Track the Rust-ML wave — `candle-core` (Hugging Face's Rust inference runtime), `burn`, `ort`, `tch-rs`, `tract` — alongside the datasets they're being benchmarked against.
- **[GitHub Trending Repos](https://apify.com/nexgendata/github-trending-repos?fpr=2ayu9b)** — daily/weekly/monthly `github.com/trending`, enriched with license, topics, README excerpt, AI flag. A dataset whose source repo just hit GitHub Trending often shows a 5–20× weekly-downloads spike on the Hub in the next 72 hours. Cross-reference dataset adoption with repo-side momentum.

---

### Pricing details (proposal — awaiting approval)

- **$0.00005** Actor Start (charged once per run, multiplied by RAM in GB)
- **$0.03** per dataset row emitted

Typical run cost:

| Scenario | Datasets | Cost |
|---|---|---|
| Smoke test | 10 | $0.30 |
| Weekly trending snapshot | 50 | $1.50 |
| Top-100 task leaderboard | 100 | $3.00 |
| Daily ecosystem sweep | 200 | $6.00 |
| Full deep archive | 1,000 | $30.00 |

A daily 200-dataset sweep runs at about **$180/month** — substantially under any commercial ML-platform analytics tool, and you get to export to anywhere instead of being locked into one vendor dashboard.

---

### Anti-bot / reliability

- `huggingface.co/api/datasets` is a **documented public JSON REST API**. No auth for public datasets. The `huggingface_hub` Python SDK uses the same endpoint.
- HF's documented unauthenticated quota is roughly **500 requests / 5 minutes / IP**. A `maxResults=1000` run fires ≤10 list-endpoint requests — well under the ceiling. The actor self-paces with a 150ms cushion between pages and retries on 429/503 with 2-4s backoff.
- All requests carry a polite User-Agent: `NexGenData scrapers@thenextgennexus.com`.

Anti-bot risk: **NONE**. The Hugging Face Hub is the open ML-community registry and welcomes well-behaved API consumers.

---

### Schema stability

The HF datasets REST API has been stable since 2021. The `?sort=`, `?filter=`, `?search=`, `?limit=`, `?offset=`, `?full=true` parameters are all documented and have shipped unchanged. The `cardData` blob and the `tags[]` array have both been stable since 2022. The canonical `task_categories` taxonomy is community-governed and rarely breaks — new task slugs get added (e.g. `text-to-video`, `mask-generation`) without invalidating existing ones. If HF ever introduces a breaking change, we'll add a `data_source_version` field and stage the migration without breaking existing pipelines.

---

### Support

Issues, schema requests, custom fields: scrapers@thenextgennexus.com.

---

### About NexGenData

NexGenData publishes 280+ buyer-intent Apify actors covering developer ecosystems (npm, PyPI, crates.io, Go modules, GitHub trending, Hacker News, Show HN, Product Hunt), ML data (Hugging Face datasets + models), SEC filings (Form 4 insider buys, Form D, 13F holdings, 8-K material events, Schedule 13D/G activist tracker), YC alumni, Delaware DOC, lead generation, competitive intelligence, stock fundamentals across 30+ global exchanges, property & macro data, and AI-MCP servers exposing all of the above to LLM agents.

All actors are pay-per-result — you only pay for rows you keep. No subscription, no seat licence, no annual contract.

**Browse the full catalog and start your free trial:** https://apify.com/nexgendata?fpr=2ayu9b

Sign up via that link and the free Apify platform credit covers your first hundred-plus rows on every actor — risk-free evaluation.

# Actor input Schema

## `task` (type: `string`):

Restrict to a single Hugging Face task category — the canonical taxonomy that labels what each dataset is built to train. Common slugs: `text-classification`, `token-classification`, `question-answering`, `translation`, `summarization`, `text-generation`, `text2text-generation`, `fill-mask`, `sentence-similarity`, `feature-extraction`, `automatic-speech-recognition`, `audio-classification`, `text-to-speech`, `image-classification`, `image-segmentation`, `object-detection`, `image-to-text`, `text-to-image`, `visual-question-answering`, `document-question-answering`, `video-classification`, `reinforcement-learning`, `tabular-classification`, `tabular-regression`, `time-series-forecasting`. The API filter is `task_categories:{slug}` — passed verbatim. Browse the full taxonomy at https://huggingface.co/datasets?task_categories=task_categories. Empty = no task filter (return all datasets).
## `language` (type: `string`):

Restrict to datasets tagged with a specific ISO 639-1 language code — e.g. `en`, `fr`, `de`, `es`, `zh`, `ja`, `ko`, `ar`, `ru`, `pt`, `hi`, `bn`, `vi`, `id`, `th`, `tr`, `pl`, `it`, `nl`, `sv`. Hugging Face uses the `language:{code}` tag — passed verbatim to the API as `filter=language:{code}`. Multi-language datasets carry one `language:` tag per language they cover, so a filter of `en` will also return multilingual datasets that include English. Use `multilingual` for datasets explicitly tagged as cross-lingual. Empty = no language filter.
## `size` (type: `string`):

Restrict to datasets in a single size bucket — Hugging Face's standard `size_categories` tag. Valid slugs: `n<1K`, `1K<n<10K`, `10K<n<100K`, `100K<n<1M`, `1M<n<10M`, `10M<n<100M`, `100M<n<1B`, `n>1T`. The size refers to the row count of the largest split. Useful for filtering toy/eval datasets (`n<1K`, `1K<n<10K`) vs. pretraining-scale corpora (`100M<n<1B`, `n>1T`). The API filter is `size_categories:{bucket}` — passed verbatim. Empty = no size filter.
## `sort` (type: `string`):

How Hugging Face should rank the listing. `downloads` is the canonical popularity signal — total all-time dataset loads via `datasets.load_dataset()` and direct file downloads, the de-facto ranking on hf.co/datasets. `likes` ranks by community upvotes — better proxy for quality/curation than raw downloads. `trending` is the rolling-7-day engagement leaderboard (currently-hot datasets — newly-released benchmarks, viral instruction-tuning corpora, fresh evaluation suites). `lastModified` surfaces actively-maintained datasets (commit-recency proxy for staleness). `createdAt` ranks by upload date — the freshest datasets first (great for catching new releases before they accumulate downloads). The actor maps each value to the correct HF API param: `downloads`, `likes`, `lastModified`, `createdAt`, and `trending` (HF's documented trending sort).
## `search` (type: `string`):

Free-text search across dataset id + description (`?search=` on the HF datasets API). Use this for keyword-based discovery — e.g. `search=instruct`, `search=code`, `search=medical`, `search=multilingual`, `search=arxiv`. Pairs well with `task`/`language`/`size` filters: combine `task=question-answering` + `search=medical` to find medical QA datasets, or `language=fr` + `search=legal` to find French legal corpora. Empty = no search (the listing returns the full HF dataset registry sorted by the chosen `sort` order, filtered by the other params).
## `maxResults` (type: `integer`):

Hard ceiling on the number of dataset rows emitted per run (1–1000). The actor fires one paginated list request per ~100 results — a run of 50 datasets is a single API call, 500 is five. Hugging Face's public API tolerates ~500 requests per 5 minutes per IP unauthenticated, so even a maxResults=1000 run (≤10 API calls) stays well below the rate ceiling. Use 10 for smoke tests, 50 for a typical trend snapshot (default), 200 for a daily-sweep of top training corpora, 1000 for a full weekly archive.

## Actor input object example

```json
{
  "task": "",
  "language": "",
  "size": "",
  "sort": "downloads",
  "search": "",
  "maxResults": 50
}
````

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "task": "",
    "language": "",
    "size": "",
    "sort": "downloads",
    "search": "",
    "maxResults": 50
};

// Run the Actor and wait for it to finish
const run = await client.actor("nexgendata/huggingface-datasets-catalog").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "task": "",
    "language": "",
    "size": "",
    "sort": "downloads",
    "search": "",
    "maxResults": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("nexgendata/huggingface-datasets-catalog").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "task": "",
  "language": "",
  "size": "",
  "sort": "downloads",
  "search": "",
  "maxResults": 50
}' |
apify call nexgendata/huggingface-datasets-catalog --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=nexgendata/huggingface-datasets-catalog",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Hugging Face Datasets Catalog — ML Training Data Intel",
        "description": "Hugging Face dataset registry: downloads, likes, last_modified, task_categories, language, size_categories, license, tags, author. Filter by task/language/size. Sort by downloads/likes/trending/modified. ML researchers, MLOps, AI compliance.",
        "version": "0.0",
        "x-build-id": "BtyoeLlVYatQZgz5j"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/nexgendata~huggingface-datasets-catalog/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-nexgendata-huggingface-datasets-catalog",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/nexgendata~huggingface-datasets-catalog/runs": {
            "post": {
                "operationId": "runs-sync-nexgendata-huggingface-datasets-catalog",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/nexgendata~huggingface-datasets-catalog/run-sync": {
            "post": {
                "operationId": "run-sync-nexgendata-huggingface-datasets-catalog",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "task": {
                        "title": "Task category filter (optional)",
                        "type": "string",
                        "description": "Restrict to a single Hugging Face task category — the canonical taxonomy that labels what each dataset is built to train. Common slugs: `text-classification`, `token-classification`, `question-answering`, `translation`, `summarization`, `text-generation`, `text2text-generation`, `fill-mask`, `sentence-similarity`, `feature-extraction`, `automatic-speech-recognition`, `audio-classification`, `text-to-speech`, `image-classification`, `image-segmentation`, `object-detection`, `image-to-text`, `text-to-image`, `visual-question-answering`, `document-question-answering`, `video-classification`, `reinforcement-learning`, `tabular-classification`, `tabular-regression`, `time-series-forecasting`. The API filter is `task_categories:{slug}` — passed verbatim. Browse the full taxonomy at https://huggingface.co/datasets?task_categories=task_categories. Empty = no task filter (return all datasets).",
                        "default": ""
                    },
                    "language": {
                        "title": "Language filter (optional)",
                        "type": "string",
                        "description": "Restrict to datasets tagged with a specific ISO 639-1 language code — e.g. `en`, `fr`, `de`, `es`, `zh`, `ja`, `ko`, `ar`, `ru`, `pt`, `hi`, `bn`, `vi`, `id`, `th`, `tr`, `pl`, `it`, `nl`, `sv`. Hugging Face uses the `language:{code}` tag — passed verbatim to the API as `filter=language:{code}`. Multi-language datasets carry one `language:` tag per language they cover, so a filter of `en` will also return multilingual datasets that include English. Use `multilingual` for datasets explicitly tagged as cross-lingual. Empty = no language filter.",
                        "default": ""
                    },
                    "size": {
                        "title": "Size bucket filter (optional)",
                        "enum": [
                            "",
                            "n<1K",
                            "1K<n<10K",
                            "10K<n<100K",
                            "100K<n<1M",
                            "1M<n<10M",
                            "10M<n<100M",
                            "100M<n<1B",
                            "n>1T"
                        ],
                        "type": "string",
                        "description": "Restrict to datasets in a single size bucket — Hugging Face's standard `size_categories` tag. Valid slugs: `n<1K`, `1K<n<10K`, `10K<n<100K`, `100K<n<1M`, `1M<n<10M`, `10M<n<100M`, `100M<n<1B`, `n>1T`. The size refers to the row count of the largest split. Useful for filtering toy/eval datasets (`n<1K`, `1K<n<10K`) vs. pretraining-scale corpora (`100M<n<1B`, `n>1T`). The API filter is `size_categories:{bucket}` — passed verbatim. Empty = no size filter.",
                        "default": ""
                    },
                    "sort": {
                        "title": "Sort order",
                        "enum": [
                            "downloads",
                            "likes",
                            "trending",
                            "lastModified",
                            "createdAt"
                        ],
                        "type": "string",
                        "description": "How Hugging Face should rank the listing. `downloads` is the canonical popularity signal — total all-time dataset loads via `datasets.load_dataset()` and direct file downloads, the de-facto ranking on hf.co/datasets. `likes` ranks by community upvotes — better proxy for quality/curation than raw downloads. `trending` is the rolling-7-day engagement leaderboard (currently-hot datasets — newly-released benchmarks, viral instruction-tuning corpora, fresh evaluation suites). `lastModified` surfaces actively-maintained datasets (commit-recency proxy for staleness). `createdAt` ranks by upload date — the freshest datasets first (great for catching new releases before they accumulate downloads). The actor maps each value to the correct HF API param: `downloads`, `likes`, `lastModified`, `createdAt`, and `trending` (HF's documented trending sort).",
                        "default": "downloads"
                    },
                    "search": {
                        "title": "Search query (optional)",
                        "type": "string",
                        "description": "Free-text search across dataset id + description (`?search=` on the HF datasets API). Use this for keyword-based discovery — e.g. `search=instruct`, `search=code`, `search=medical`, `search=multilingual`, `search=arxiv`. Pairs well with `task`/`language`/`size` filters: combine `task=question-answering` + `search=medical` to find medical QA datasets, or `language=fr` + `search=legal` to find French legal corpora. Empty = no search (the listing returns the full HF dataset registry sorted by the chosen `sort` order, filtered by the other params).",
                        "default": ""
                    },
                    "maxResults": {
                        "title": "Max datasets",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Hard ceiling on the number of dataset rows emitted per run (1–1000). The actor fires one paginated list request per ~100 results — a run of 50 datasets is a single API call, 500 is five. Hugging Face's public API tolerates ~500 requests per 5 minutes per IP unauthenticated, so even a maxResults=1000 run (≤10 API calls) stays well below the rate ceiling. Use 10 for smoke tests, 50 for a typical trend snapshot (default), 200 for a daily-sweep of top training corpora, 1000 for a full weekly archive.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```