Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

Hugging Face Datasets Scraper

Deprecated

See alternative Actors

Scrape the Hugging Face datasets catalog. Filter by task, language, license, or author. Sort by downloads, likes, or trending. Extracts metadata for 200k+ ML datasets.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What does it do?

This actor calls the Hugging Face Datasets API and returns structured metadata for every matching dataset. You can browse the entire catalog or narrow results by ML task, programming language, license type, or author/organization.

For each dataset, the actor extracts:

Field	Description
`id`	Unique dataset ID (e.g. `huggingface/llm-perf-dataset`)
`author`	Author or organization name
`description`	Dataset card description (up to 500 chars)
`downloads`	Number of downloads in the last 30 days
`likes`	Number of Hugging Face likes
`license`	License type (e.g. `mit`, `apache-2.0`, `cc-by-4.0`)
`taskCategories`	ML task categories (e.g. `text-classification`)
`languages`	Language codes (e.g. `en`, `fr`, `zh`)
`formats`	Data formats (e.g. `parquet`, `csv`, `json`)
`modalities`	Data modalities (e.g. `text`, `image`, `audio`)
`libraries`	Compatible ML libraries (e.g. `datasets`, `transformers`)
`sizeCategory`	Dataset size bucket (e.g. `1K<n<10K`, `n>1T`)
`tags`	All raw tags associated with the dataset
`lastModified`	ISO 8601 timestamp of the last update
`createdAt`	ISO 8601 timestamp of creation
`isGated`	Whether access approval is required
`isDisabled`	Whether the dataset has been disabled
`url`	Direct link to the dataset page
`scrapedAt`	ISO 8601 timestamp of when data was collected

Who is it for?

🧑‍🔬 ML Researchers discovering training and evaluation data — search for datasets matching your task (e.g. question-answering) and language (e.g. en) in one API call.

🏢 AI teams building fine-tuning pipelines — programmatically monitor which datasets are available for a given domain (NLP, computer vision, speech) and track their download trends over time.

📊 Data scientists cataloging benchmarks — pull dataset metadata into your own database to compare size categories, licenses, and task coverage across thousands of datasets.

🔬 Academic researchers aggregating dataset availability — find all datasets from a specific organization (e.g. google, facebook, EleutherAI) and track their usage.

🛠️ Developer tool builders — add HuggingFace dataset search to your app without building your own API integration.

Why use this actor?

Hugging Face has 200,000+ public datasets. Browsing them manually is slow and doesn't scale. This actor gives you programmatic access with filtering and sorting that would otherwise require multiple API calls with custom pagination logic.

✅ No browser automation — pure HTTP API calls, extremely fast and reliable
✅ Structured output — tags are parsed into typed arrays (taskCategories, languages, formats, etc.) so you don't have to post-process raw tag strings
✅ Pagination handled — fetches all matching datasets automatically, not just the first page
✅ Pay only for what you get — PPE pricing means you pay per dataset extracted, not per minute

How to use it

Step 1: Set your filters

Choose what to scrape:

Search query — keyword to search across dataset names and descriptions (e.g. text classification, medical imaging)
Filter by task — ML task category from the HuggingFace tasks list (e.g. text-generation, image-segmentation)
Filter by language — ISO 639-1 language code (e.g. en, zh, de)
Filter by license — SPDX license ID (e.g. mit, apache-2.0, cc-by-4.0, cc0-1.0)
Filter by author — organization or user name (e.g. huggingface, google, allenai)

Step 2: Choose sort order

downloads — most downloaded in the last 30 days (default, best for popularity)
likes — most liked by the community
lastModified — recently updated datasets
trending — trending datasets by Hugging Face's trending score

Step 3: Set max results

Use 20 for a quick look
Use 100 for a comprehensive dataset for a specific niche
Use 0 for unlimited (all matching datasets)

Step 4: Run and export

Results appear in the dataset tab. Export as JSON, CSV, or XLSX. Use the API to integrate with your pipelines.

Input parameters

Parameter	Type	Default	Description
`searchQuery`	string	`""`	Keyword to search across names and descriptions
`filterByTask`	string	`""`	ML task category (e.g. `text-classification`)
`filterByLanguage`	string	`""`	ISO language code (e.g. `en`, `zh`)
`filterByLicense`	string	`""`	License identifier (e.g. `mit`, `apache-2.0`)
`filterByAuthor`	string	`""`	Author or organization name
`sortBy`	enum	`downloads`	Sort order: `downloads`, `likes`, `lastModified`, `trending`
`maxResults`	integer	`100`	Max datasets to return (0 = unlimited)
`maxRequestRetries`	integer	`3`	Retry attempts for failed API calls

Output example

{
  "id": "rajpurkar/squad",
  "author": "rajpurkar",
  "description": "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles...",
  "downloads": 284512,
  "likes": 1423,
  "license": "cc-by-sa-4.0",
  "taskCategories": ["question-answering"],
  "languages": ["en"],
  "formats": ["parquet"],
  "modalities": ["text"],
  "libraries": ["datasets", "mlcroissant"],
  "sizeCategory": "100K<n<1M",
  "tags": [
    "task_categories:question-answering",
    "language:en",
    "license:cc-by-sa-4.0",
    "size_categories:100K<n<1M",
    "format:parquet",
    "modality:text",
    "library:datasets",
    "library:mlcroissant"
  ],
  "lastModified": "2024-03-15T10:22:30.000Z",
  "createdAt": "2022-03-02T23:29:22.000Z",
  "isGated": false,
  "isDisabled": false,
  "url": "https://huggingface.co/datasets/rajpurkar/squad",
  "scrapedAt": "2026-04-28T08:00:00.000Z"
}

Supported filter values

Task categories (common values for `filterByTask`)

Task	Filter value
Text classification	`text-classification`
Question answering	`question-answering`
Text generation	`text-generation`
Token classification	`token-classification`
Translation	`translation`
Summarization	`summarization`
Image classification	`image-classification`
Object detection	`object-detection`
Image segmentation	`image-segmentation`
Automatic speech recognition	`automatic-speech-recognition`

See the full list at huggingface.co/tasks.

Common licenses for `filterByLicense`

mit, apache-2.0, cc-by-4.0, cc-by-sa-4.0, cc0-1.0, openrail, gpl-3.0, llama2, llama3

Tips for best results

💡 Combine filters — Task + language + license filters work together. Run filterByTask=text-classification, filterByLanguage=en, filterByLicense=mit to find only MIT-licensed English text classification datasets.

💡 Use sortBy=trending — Great for discovering newly popular datasets before they get too competitive for fine-tuning.

💡 Monitor a specific org — Set filterByAuthor=google and sortBy=lastModified to track Google's latest dataset releases.

💡 No search + no filters — Returns the most popular datasets overall. Great for building a "top datasets" leaderboard.

💡 maxResults=0 — Returns everything matching your filters. For broad queries, this can be thousands of datasets — use specific filters to narrow down first.

How much does it cost to scrape Hugging Face datasets?

This actor uses pay-per-event (PPE) pricing — you only pay for datasets successfully extracted, not for time or failed runs.

Plan	Price per dataset
FREE	$0.00115
BRONZE	$0.001
SILVER	$0.00078
GOLD	$0.0006
PLATINUM	$0.0004
DIAMOND	$0.00028

Example costs:

100 datasets → ~$0.115 on FREE plan
1,000 datasets → ~$0.78–$1.00 on BRONZE/SILVER
10,000 datasets → ~$4–$6 on GOLD

Since this actor uses direct API calls (no proxies, no browser), compute costs are negligible. You get roughly 1,000 datasets per dollar on the BRONZE plan.

The free Apify plan includes $5 monthly credit — enough for ~10,000 datasets per month at no charge.

Integrations

🔗 Connect with other actors

Build an ML dataset pipeline:

Run Hugging Face Datasets Scraper to find relevant datasets (filter by task, language, license)
Feed dataset IDs to a download/processing workflow
Store in your vector database for training

Monitor dataset availability:

Schedule weekly runs to track new datasets in your niche
Compare download counts over time to spot rising datasets
Alert your team when a gated dataset becomes public

Research cataloging:

Scrape all datasets from specific organizations
Export to Google Sheets or Notion
Filter and annotate for your literature review

🤖 Use with other Hugging Face scrapers

Actor	What it does
Hugging Face Scraper	Scrape models, spaces, and papers from HuggingFace Hub
Hugging Face Papers Scraper	Get ML papers with authors, abstracts, and citation counts

API usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/huggingface-datasets-scraper').call({
    searchQuery: 'instruction tuning',
    filterByLanguage: 'en',
    sortBy: 'downloads',
    maxResults: 100,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Fetched ${items.length} datasets`);
items.forEach(d => console.log(d.id, d.downloads, d.license));

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("automation-lab/huggingface-datasets-scraper").call(run_input={
    "searchQuery": "instruction tuning",
    "filterByLanguage": "en",
    "sortBy": "downloads",
    "maxResults": 100,
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Fetched {len(items)} datasets")
for d in items:
    print(d["id"], d["downloads"], d["license"])

cURL

# Start the actor
curl -X POST "https://api.apify.com/v2/acts/automation-lab~huggingface-datasets-scraper/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "searchQuery": "instruction tuning",
    "filterByLanguage": "en",
    "sortBy": "downloads",
    "maxResults": 100
  }'

# Get results (replace RUN_ID and DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_APIFY_TOKEN"

Use with Claude and other AI assistants (MCP)

You can use this actor directly with Claude via the Model Context Protocol (MCP), making it callable from Claude Code, Claude Desktop, Cursor, and VS Code.

Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/huggingface-datasets-scraper"

Claude Desktop / Cursor / VS Code

Add to your MCP config file (~/.claude/claude_desktop_config.json or equivalent):

{
  "mcpServers": {
    "apify": {
      "type": "http",
      "url": "https://mcp.apify.com?tools=automation-lab/huggingface-datasets-scraper",
      "headers": {
        "Authorization": "Bearer YOUR_APIFY_TOKEN"
      }
    }
  }
}

Example prompts

Once connected, try these prompts with Claude:

"Find the top 20 most downloaded English text-classification datasets on Hugging Face."
"Get me all datasets from the 'allenai' organization sorted by likes."
"List the 50 most popular MIT-licensed datasets on Hugging Face."
"Find trending image segmentation datasets from this week."

Legality

Hugging Face's Terms of Service allow accessing public dataset metadata. This actor calls the official public HuggingFace API (/api/datasets) — the same endpoint used by the HuggingFace website and all official client libraries.

✅ Only public dataset metadata is collected
✅ Gated datasets are listed (metadata only) but their contents are NOT downloaded
✅ Private datasets are excluded by default
✅ Respects API rate limits with automatic retry and backoff

FAQ

Q: Can I scrape private or gated datasets? A: The actor lists gated datasets (those requiring access approval) in metadata, but cannot access their content. Fully private datasets are not accessible via the public API.

Q: How many datasets are on Hugging Face? A: There are 200,000+ public datasets as of 2026, growing rapidly. Use maxResults: 0 with specific filters to see all matches.

Q: The actor returned fewer results than my maxResults — why? A: Your filters matched fewer datasets than requested. Try broadening your filters (e.g., remove the license filter) to get more results.

Q: What does sortBy=trending actually measure? A: Hugging Face's trending score is a proprietary algorithm combining recent downloads, likes, and activity. It changes daily and highlights fast-rising datasets.

Q: Can I filter by multiple tasks at once? A: The current version supports one task filter per run. To get results for multiple tasks, run the actor separately for each task and combine the outputs.

Q: The actor ran but I got 0 results — what happened? A: Check that your filter values are valid. Task categories use underscores (e.g., text-classification not text classification). Language codes are ISO 639-1 lowercase (e.g., en, fr, zh). License IDs must match HuggingFace's exact format (e.g., apache-2.0, cc-by-4.0).

Hugging Face Scraper — Scrape HuggingFace models, spaces, and papers
Hugging Face Papers Scraper — Get ML research papers with authors and abstracts

Hugging Face Datasets Scraper

parseforge/hugging-face-datasets-scraper

Scrape dataset metadata from Hugging Face Hub. Extract names, authors, download counts, likes, trending scores, task categories, size categories, languages, licenses, tags and descriptions. Filter by search query, task type, language, or license. Sort by trending, downloads, likes, or last modified.

ParseForge

Hugging Face Model & Dataset Scraper

cloud9_ai/huggingface-scraper

Search and extract ML models and datasets from Hugging Face Hub. Get model cards, download stats, tasks, and architectures. No API key needed.

cloud9

Hugging Face Scraper - Models Datasets Spaces

openclawmara/huggingface-scraper

Scrape Hugging Face models, datasets, and Spaces. Extracts metadata, downloads, likes, tags, and usage stats. Ideal for AI model discovery, competitive analysis, and tracking trending ML resources.

OpenClaw Mara

HuggingFace Scraper (All-in-One) 🚀🤗🔎

scrapestorm/huggingface-scraper-all-in-one

🟠 Easily collect Models, Datasets & Spaces from Hugging Face Provide one or multiple search keywords and extract data across the entire HuggingFace ecosystem including Repository name 👤 Owner 🔗 Source search URL & more… Perfect for AI architecture research & full ecosystem intelligence 🚀🤖

Storm_Scraper

5.0

(1)

Carsandbids.com Scraper - Cheap 🔨📊🚗

scrapestorm/carsandbids-com-scraper---cheap

🔎 Easily collect car auction listings by providing one or multiple search URLs 🚗🔨 Extract valuable auction insights such as 🚗 Vehicle Title 📌 Auction Status 💰 Current Bid / Sale Price✨ Image 🔗 Listing URL & more Perfect for car auction monitoring & automotive market analysis 📊🚘📈

Storm_Scraper

5.0

(1)

CMS Nursing Home Ratings Scraper - Cheap🏥⭐📊

scrapestorm/cms-nursing-home-ratings-scraper---cheap

🔎 Easily collect nursing home ratings from CMS Care Compare Provide one or multiple URLs and extract healthcare intelligence such as 🏥 Nursing Home Name ⭐ Overall Rating 👩‍⚕️ Staffing Rating🏥 Inspection Score 📍 Location 📞 Phone & more Perfect for healthcare research & health data insights 🏥

Storm_Scraper

5.0

(1)

HuggingFace Hub Scraper - Models, Datasets, Spaces

wetyr_corporation/huggingface-hub-scraper

Bulk extract AI models, datasets, and Spaces from HuggingFace. Filter by task, library, license, author. Pulls downloads, likes, tags, model cards.

WETYR

Kaggle Dataset Scraper — Search, Metadata & Trending

openclawmara/kaggle-dataset-scraper

Scrape Kaggle datasets marketplace. Modes: search by keyword/tag, dataset details (owner, license, file list, size, votes, downloads), trending, and user profiles. Extracts titles, descriptions, updated dates, usability scores. Ideal for ML dataset discovery and competitive landscape research.

OpenClaw Mara

HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

makework36/huggingface-hub-scraper

Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.

deusex machine

HuggingFace Hub Scraper

devilscrapes/huggingface-hub-scraper

Export models, datasets, and Spaces from HuggingFace Hub. Filter by task, library, or author. Trending snapshot mode. No login needed. Richer schema than incumbents.

DevilScrapes

Hugging Face Scraper — AI Models, Datasets, Spaces & Papers

logiover/huggingface-hub-intelligence-scraper

Export every AI model, dataset, space and daily paper from the Hugging Face Hub. Filter by task, library (transformers, diffusers, GGUF), language, license, author. Sort by downloads, likes, trending. Sibling files + README. Public HF API, no token. For AI builders, ML research, RAG and VC AI intel.

Logiover

Hugging Face Datasets Scraper

What does it do?

Who is it for?

Why use this actor?

How to use it

Step 1: Set your filters

Step 2: Choose sort order

Step 3: Set max results

Step 4: Run and export

Input parameters

Output example

Supported filter values

Task categories (common values for filterByTask)

Common licenses for filterByLicense

Tips for best results

How much does it cost to scrape Hugging Face datasets?

Integrations

🔗 Connect with other actors

🤖 Use with other Hugging Face scrapers

API usage

Node.js

Python

cURL

Use with Claude and other AI assistants (MCP)

Claude Code

Claude Desktop / Cursor / VS Code

Example prompts

Legality

FAQ

Related actors

You might also like

Hugging Face Datasets Scraper

Hugging Face Model & Dataset Scraper

Hugging Face Scraper - Models Datasets Spaces

HuggingFace Scraper (All-in-One) 🚀🤗🔎

Carsandbids.com Scraper - Cheap 🔨📊🚗

CMS Nursing Home Ratings Scraper - Cheap🏥⭐📊

HuggingFace Hub Scraper - Models, Datasets, Spaces

Kaggle Dataset Scraper — Search, Metadata & Trending

HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors

HuggingFace Hub Scraper

Hugging Face Scraper — AI Models, Datasets, Spaces & Papers

Task categories (common values for `filterByTask`)

Common licenses for `filterByLicense`