Hugging Face Datasets Scraper avatar

Hugging Face Datasets Scraper

Pricing

Pay per event

Go to Apify Store
Hugging Face Datasets Scraper

Hugging Face Datasets Scraper

Scrape the Hugging Face datasets catalog. Filter by task, language, license, or author. Sort by downloads, likes, or trending. Extracts metadata for 200k+ ML datasets.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Scrape the Hugging Face datasets catalog — search by keyword, filter by task category, language, license, or author, and sort by downloads, likes, or trending. Extract full metadata for 200,000+ public ML datasets including download counts, tags, descriptions, and direct links.

What does it do?

This actor calls the Hugging Face Datasets API and returns structured metadata for every matching dataset. You can browse the entire catalog or narrow results by ML task, programming language, license type, or author/organization.

For each dataset, the actor extracts:

FieldDescription
idUnique dataset ID (e.g. huggingface/llm-perf-dataset)
authorAuthor or organization name
descriptionDataset card description (up to 500 chars)
downloadsNumber of downloads in the last 30 days
likesNumber of Hugging Face likes
licenseLicense type (e.g. mit, apache-2.0, cc-by-4.0)
taskCategoriesML task categories (e.g. text-classification)
languagesLanguage codes (e.g. en, fr, zh)
formatsData formats (e.g. parquet, csv, json)
modalitiesData modalities (e.g. text, image, audio)
librariesCompatible ML libraries (e.g. datasets, transformers)
sizeCategoryDataset size bucket (e.g. 1K<n<10K, n>1T)
tagsAll raw tags associated with the dataset
lastModifiedISO 8601 timestamp of the last update
createdAtISO 8601 timestamp of creation
isGatedWhether access approval is required
isDisabledWhether the dataset has been disabled
urlDirect link to the dataset page
scrapedAtISO 8601 timestamp of when data was collected

Who is it for?

🧑‍🔬 ML Researchers discovering training and evaluation data — search for datasets matching your task (e.g. question-answering) and language (e.g. en) in one API call.

🏢 AI teams building fine-tuning pipelines — programmatically monitor which datasets are available for a given domain (NLP, computer vision, speech) and track their download trends over time.

📊 Data scientists cataloging benchmarks — pull dataset metadata into your own database to compare size categories, licenses, and task coverage across thousands of datasets.

🔬 Academic researchers aggregating dataset availability — find all datasets from a specific organization (e.g. google, facebook, EleutherAI) and track their usage.

🛠️ Developer tool builders — add HuggingFace dataset search to your app without building your own API integration.

Why use this actor?

Hugging Face has 200,000+ public datasets. Browsing them manually is slow and doesn't scale. This actor gives you programmatic access with filtering and sorting that would otherwise require multiple API calls with custom pagination logic.

  • No browser automation — pure HTTP API calls, extremely fast and reliable
  • Structured output — tags are parsed into typed arrays (taskCategories, languages, formats, etc.) so you don't have to post-process raw tag strings
  • Pagination handled — fetches all matching datasets automatically, not just the first page
  • Pay only for what you get — PPE pricing means you pay per dataset extracted, not per minute

How to use it

Step 1: Set your filters

Choose what to scrape:

  • Search query — keyword to search across dataset names and descriptions (e.g. text classification, medical imaging)
  • Filter by task — ML task category from the HuggingFace tasks list (e.g. text-generation, image-segmentation)
  • Filter by language — ISO 639-1 language code (e.g. en, zh, de)
  • Filter by license — SPDX license ID (e.g. mit, apache-2.0, cc-by-4.0, cc0-1.0)
  • Filter by author — organization or user name (e.g. huggingface, google, allenai)

Step 2: Choose sort order

  • downloads — most downloaded in the last 30 days (default, best for popularity)
  • likes — most liked by the community
  • lastModified — recently updated datasets
  • trending — trending datasets by Hugging Face's trending score

Step 3: Set max results

  • Use 20 for a quick look
  • Use 100 for a comprehensive dataset for a specific niche
  • Use 0 for unlimited (all matching datasets)

Step 4: Run and export

Results appear in the dataset tab. Export as JSON, CSV, or XLSX. Use the API to integrate with your pipelines.

Input parameters

ParameterTypeDefaultDescription
searchQuerystring""Keyword to search across names and descriptions
filterByTaskstring""ML task category (e.g. text-classification)
filterByLanguagestring""ISO language code (e.g. en, zh)
filterByLicensestring""License identifier (e.g. mit, apache-2.0)
filterByAuthorstring""Author or organization name
sortByenumdownloadsSort order: downloads, likes, lastModified, trending
maxResultsinteger100Max datasets to return (0 = unlimited)
maxRequestRetriesinteger3Retry attempts for failed API calls

Output example

{
"id": "rajpurkar/squad",
"author": "rajpurkar",
"description": "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles...",
"downloads": 284512,
"likes": 1423,
"license": "cc-by-sa-4.0",
"taskCategories": ["question-answering"],
"languages": ["en"],
"formats": ["parquet"],
"modalities": ["text"],
"libraries": ["datasets", "mlcroissant"],
"sizeCategory": "100K<n<1M",
"tags": [
"task_categories:question-answering",
"language:en",
"license:cc-by-sa-4.0",
"size_categories:100K<n<1M",
"format:parquet",
"modality:text",
"library:datasets",
"library:mlcroissant"
],
"lastModified": "2024-03-15T10:22:30.000Z",
"createdAt": "2022-03-02T23:29:22.000Z",
"isGated": false,
"isDisabled": false,
"url": "https://huggingface.co/datasets/rajpurkar/squad",
"scrapedAt": "2026-04-28T08:00:00.000Z"
}

Supported filter values

Task categories (common values for filterByTask)

TaskFilter value
Text classificationtext-classification
Question answeringquestion-answering
Text generationtext-generation
Token classificationtoken-classification
Translationtranslation
Summarizationsummarization
Image classificationimage-classification
Object detectionobject-detection
Image segmentationimage-segmentation
Automatic speech recognitionautomatic-speech-recognition

See the full list at huggingface.co/tasks.

Common licenses for filterByLicense

mit, apache-2.0, cc-by-4.0, cc-by-sa-4.0, cc0-1.0, openrail, gpl-3.0, llama2, llama3

Tips for best results

💡 Combine filters — Task + language + license filters work together. Run filterByTask=text-classification, filterByLanguage=en, filterByLicense=mit to find only MIT-licensed English text classification datasets.

💡 Use sortBy=trending — Great for discovering newly popular datasets before they get too competitive for fine-tuning.

💡 Monitor a specific org — Set filterByAuthor=google and sortBy=lastModified to track Google's latest dataset releases.

💡 No search + no filters — Returns the most popular datasets overall. Great for building a "top datasets" leaderboard.

💡 maxResults=0 — Returns everything matching your filters. For broad queries, this can be thousands of datasets — use specific filters to narrow down first.

How much does it cost to scrape Hugging Face datasets?

This actor uses pay-per-event (PPE) pricing — you only pay for datasets successfully extracted, not for time or failed runs.

PlanPrice per dataset
FREE$0.00115
BRONZE$0.001
SILVER$0.00078
GOLD$0.0006
PLATINUM$0.0004
DIAMOND$0.00028

Example costs:

  • 100 datasets → ~$0.115 on FREE plan
  • 1,000 datasets → ~$0.78–$1.00 on BRONZE/SILVER
  • 10,000 datasets → ~$4–$6 on GOLD

Since this actor uses direct API calls (no proxies, no browser), compute costs are negligible. You get roughly 1,000 datasets per dollar on the BRONZE plan.

The free Apify plan includes $5 monthly credit — enough for ~10,000 datasets per month at no charge.

Integrations

🔗 Connect with other actors

Build an ML dataset pipeline:

  1. Run Hugging Face Datasets Scraper to find relevant datasets (filter by task, language, license)
  2. Feed dataset IDs to a download/processing workflow
  3. Store in your vector database for training

Monitor dataset availability:

  • Schedule weekly runs to track new datasets in your niche
  • Compare download counts over time to spot rising datasets
  • Alert your team when a gated dataset becomes public

Research cataloging:

  1. Scrape all datasets from specific organizations
  2. Export to Google Sheets or Notion
  3. Filter and annotate for your literature review

🤖 Use with other Hugging Face scrapers

ActorWhat it does
Hugging Face ScraperScrape models, spaces, and papers from HuggingFace Hub
Hugging Face Papers ScraperGet ML papers with authors, abstracts, and citation counts

API usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/huggingface-datasets-scraper').call({
searchQuery: 'instruction tuning',
filterByLanguage: 'en',
sortBy: 'downloads',
maxResults: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Fetched ${items.length} datasets`);
items.forEach(d => console.log(d.id, d.downloads, d.license));

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("automation-lab/huggingface-datasets-scraper").call(run_input={
"searchQuery": "instruction tuning",
"filterByLanguage": "en",
"sortBy": "downloads",
"maxResults": 100,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Fetched {len(items)} datasets")
for d in items:
print(d["id"], d["downloads"], d["license"])

cURL

# Start the actor
curl -X POST "https://api.apify.com/v2/acts/automation-lab~huggingface-datasets-scraper/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"searchQuery": "instruction tuning",
"filterByLanguage": "en",
"sortBy": "downloads",
"maxResults": 100
}'
# Get results (replace RUN_ID and DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_APIFY_TOKEN"

Use with Claude and other AI assistants (MCP)

You can use this actor directly with Claude via the Model Context Protocol (MCP), making it callable from Claude Code, Claude Desktop, Cursor, and VS Code.

Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/huggingface-datasets-scraper"

Claude Desktop / Cursor / VS Code

Add to your MCP config file (~/.claude/claude_desktop_config.json or equivalent):

{
"mcpServers": {
"apify": {
"type": "http",
"url": "https://mcp.apify.com?tools=automation-lab/huggingface-datasets-scraper",
"headers": {
"Authorization": "Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Example prompts

Once connected, try these prompts with Claude:

  • "Find the top 20 most downloaded English text-classification datasets on Hugging Face."
  • "Get me all datasets from the 'allenai' organization sorted by likes."
  • "List the 50 most popular MIT-licensed datasets on Hugging Face."
  • "Find trending image segmentation datasets from this week."

Legality

Hugging Face's Terms of Service allow accessing public dataset metadata. This actor calls the official public HuggingFace API (/api/datasets) — the same endpoint used by the HuggingFace website and all official client libraries.

  • ✅ Only public dataset metadata is collected
  • ✅ Gated datasets are listed (metadata only) but their contents are NOT downloaded
  • ✅ Private datasets are excluded by default
  • ✅ Respects API rate limits with automatic retry and backoff

FAQ

Q: Can I scrape private or gated datasets? A: The actor lists gated datasets (those requiring access approval) in metadata, but cannot access their content. Fully private datasets are not accessible via the public API.

Q: How many datasets are on Hugging Face? A: There are 200,000+ public datasets as of 2026, growing rapidly. Use maxResults: 0 with specific filters to see all matches.

Q: The actor returned fewer results than my maxResults — why? A: Your filters matched fewer datasets than requested. Try broadening your filters (e.g., remove the license filter) to get more results.

Q: What does sortBy=trending actually measure? A: Hugging Face's trending score is a proprietary algorithm combining recent downloads, likes, and activity. It changes daily and highlights fast-rising datasets.

Q: Can I filter by multiple tasks at once? A: The current version supports one task filter per run. To get results for multiple tasks, run the actor separately for each task and combine the outputs.

Q: The actor ran but I got 0 results — what happened? A: Check that your filter values are valid. Task categories use underscores (e.g., text-classification not text classification). Language codes are ISO 639-1 lowercase (e.g., en, fr, zh). License IDs must match HuggingFace's exact format (e.g., apache-2.0, cc-by-4.0).