Hugging Face Datasets Scraper avatar

Hugging Face Datasets Scraper

Pricing

$5.00/month + usage

Go to Apify Store
Hugging Face Datasets Scraper

Hugging Face Datasets Scraper

Scrape dataset metadata from Hugging Face Hub. Extract names, authors, download counts, likes, trending scores, task categories, size categories, languages, licenses, tags and descriptions. Filter by search query, task type, language, or license. Sort by trending, downloads, likes, or last modified.

Pricing

$5.00/month + usage

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

🚀 Hugging Face Datasets Scraper

Extract structured metadata from thousands of Hugging Face datasets in minutes. Whether you are an AI researcher tracking trending datasets, a data engineer building training pipelines, or a business analyst benchmarking the ML ecosystem, this tool gives you the dataset intelligence you need without any manual browsing.

The Hugging Face Datasets Scraper connects directly to Hugging Face's public catalog and returns rich metadata for every dataset matching your search. Filter by task type, language, license, and more to get precisely the data you need. Results are instantly downloadable as CSV, Excel, or JSON, ready for further analysis or integration into your workflows.

No sign-up on Hugging Face is required. No coding knowledge is needed. Simply set your filters, run the tool, and download your results.

✨ What Does It Do

The tool collects detailed metadata for each Hugging Face dataset, including:

  • Dataset ID and name - The unique identifier and human-readable name for each dataset, so you can reference and cite datasets accurately
  • Direct URL - A link to the dataset's page on Hugging Face, giving you instant access to documentation and download options
  • Author - The organization or individual who published the dataset, enabling you to track activity from specific research groups or companies
  • Download count - Total number of times the dataset has been downloaded, so you can gauge adoption and community trust at a glance
  • Likes - Community engagement score showing how many users have liked the dataset, helping you identify quality data sources quickly
  • Trending score - A real-time popularity metric showing which datasets are gaining traction, so you can stay ahead of emerging trends in AI research
  • Task categories - The ML tasks the dataset is designed for (e.g., text-classification, image-segmentation), enabling precise filtering for your project requirements
  • Size categories - Dataset size ranges (e.g., 1K-10K, 10K-100K rows), so you can select datasets that match your compute and storage budget
  • Languages - Language codes for multilingual or language-specific datasets, helping global teams find the right training data for their target markets
  • License - The open-source or commercial license governing dataset use, so compliance and legal teams can quickly clear datasets for production use
  • Tags - Additional descriptive tags attached to the dataset, giving you richer search and filtering options beyond standard categories
  • Description - The dataset's overview text as written by its author, giving you context to evaluate fit without leaving your workflow
  • Last modified and created dates - Timestamps for tracking freshness and age, so you can prioritize recently updated datasets in fast-moving research areas
  • Scraped at timestamp - When this record was collected, for audit and refresh tracking

🔧 Input

Configure the tool using these simple settings. All fields are optional - run it with no filters to get the most trending datasets.

FieldDescription
Max ItemsHow many datasets to return. Supports up to 1,000,000 records per run.
Search QueryA keyword or phrase to search for (e.g., "image segmentation", "clinical trials", "LLM pretraining").
Task CategoryFilter to a specific ML task type (e.g., text-classification, text-generation, image-classification).
LanguageFilter by language code (e.g., en for English, zh for Chinese, fr for French).
LicenseFilter by license type (e.g., apache-2.0, mit, cc-by-4.0).
Sort BySort results by: trendingScore, downloads, likes, or lastModified. Default is trending score. Results are always sorted in descending order (highest first).

Example input:

{
"maxItems": 50,
"query": "text classification",
"taskCategory": "text-classification",
"language": "en",
"license": "apache-2.0",
"sort": "downloads"
}

📊 Output

Each dataset record is returned as a structured JSON object. Here is a realistic example:

{
"imageUrl": "https://cdn-avatars.huggingface.co/v1/production/uploads/...",
"id": "stanfordnlp/sst2",
"name": "SST-2 (Stanford Sentiment Treebank)",
"url": "https://huggingface.co/datasets/stanfordnlp/sst2",
"author": "stanfordnlp",
"downloads": 1842039,
"likes": 312,
"trendingScore": 8.74,
"taskCategories": ["text-classification"],
"sizeCategories": ["10K<n<100K"],
"language": ["en"],
"license": "apache-2.0",
"tags": ["text", "sentiment", "nlp", "classification"],
"private": false,
"gated": false,
"disabled": false,
"description": "The Stanford Sentiment Treebank consists of sentences from movie reviews...",
"lastModified": "2024-03-10T11:22:00.000Z",
"createdAt": "2022-06-15T08:00:00.000Z",
"scrapedAt": "2026-02-19T09:00:00.000Z"
}

Field glossary:

  • id - Unique dataset identifier in author/dataset-name format
  • name - Human-friendly display name
  • url - Full link to the dataset page
  • author - Publisher of the dataset
  • downloads - Cumulative download count (great for gauging adoption)
  • likes - Community upvote count
  • trendingScore - Real-time trending rank assigned by Hugging Face
  • taskCategories - List of ML tasks this dataset supports
  • sizeCategories - Approximate row count brackets
  • language - List of language codes
  • license - License identifier
  • tags - Descriptive topic tags
  • description - Dataset summary text
  • lastModified and createdAt - ISO 8601 timestamps
  • scrapedAt - Timestamp of when this record was collected

Download your results in CSV, Excel, or JSON directly from the Apify platform.

💎 Why Choose the Hugging Face Datasets Scraper?

Comprehensive metadata in one place. Manually browsing Hugging Face to gather dataset statistics is slow and impractical at scale. This tool collects every relevant metadata field in a single automated run.

Powerful filtering options. Most manual searches give you generic results. With this tool, you can combine keyword search, task category, language, license, and sort order to find exactly the datasets you need.

No authentication required. The tool works entirely against Hugging Face's public catalog. You do not need an account, access token, or special permissions.

Reliable, paginated collection. The tool uses cursor-based pagination with automatic retry logic to ensure no datasets are missed even when collecting tens of thousands of records.

Instant, structured output. Results come back as clean, normalized JSON - no raw HTML to parse or inconsistent field names to reconcile.

Built for large-scale research. Collect up to 1,000,000 dataset records in a single run, making this ideal for ecosystem-wide benchmarking and trend analysis.

📋 How to Use

No technical skills required. Follow these steps:

  1. Sign Up: Create a free account w/ $5 credit
  2. Find the Actor: Search for "Hugging Face Datasets Scraper" in the Apify Store, or navigate directly to the actor page.
  3. Set Your Filters: Enter your search query, task category, language, license, and how many results you want.
  4. Run the Tool: Click the "Start" button. The tool will begin collecting data immediately.
  5. Download Your Results: Once the run completes, download your dataset as CSV, Excel, or JSON from the Storage tab.

The tool handles pagination automatically so you can collect exactly as many datasets as you need.

🎯 Business Use Cases

AI Researchers and Data Scientists

  • Discover trending datasets in your target domain before starting a new project
  • Track download velocity to identify which datasets the community is actively using
  • Compare dataset sizes and licenses to find the best fit for your training pipeline
  • Monitor newly published datasets in a specific language or task area

ML Platform and Tooling Teams

  • Audit the Hugging Face ecosystem for coverage gaps across tasks or languages
  • Build internal dataset catalogs by pulling structured metadata in bulk
  • Track competitive activity by monitoring which organizations publish the most datasets
  • Identify underserved niches where new datasets could gain rapid adoption

Market Researchers and Analysts

  • Measure the growth of the open-source AI data ecosystem over time
  • Benchmark dataset publishing activity by organization or research group
  • Analyze license distribution trends across the ML community
  • Generate reports on the most-downloaded datasets in specific categories

Procurement and Compliance Teams

  • Filter datasets by license to identify those cleared for commercial use
  • Collect metadata for dataset due diligence before integrating into production systems
  • Track gated or restricted datasets that may require additional agreements

Educators and Course Creators

  • Find highly-liked, well-documented datasets suitable for teaching data science
  • Identify multilingual datasets for international curriculum development
  • Build curated lists of beginner-friendly datasets by task category

❓ FAQ

How does the tool work? The tool queries Hugging Face's public dataset catalog using their openly available listing interface. It automatically pages through results, applies your filters, and collects metadata for each matching dataset. No coding or technical setup is needed on your part.

How accurate is the data? The tool collects data directly from Hugging Face's live catalog, so download counts, likes, and trending scores reflect the state of the platform at the time of your run. For the freshest data, simply re-run the tool.

Can I collect data on a schedule? Yes. Apify supports scheduled runs. Set your actor to run daily, weekly, or at any custom interval to keep your dataset inventory up to date automatically.

Is there a limit on how many datasets I can collect? The tool supports up to 1,000,000 records per run. Set the Max Items field to control how many results you receive.

Do I need a Hugging Face account? No. The tool accesses Hugging Face's publicly available dataset catalog and does not require any login, token, or special permission.

What if I need help or have a custom requirement? Check the FAQ above, visit the Apify support page, or use the Contact section below to reach out for custom data projects.

🔗 Integrate Hugging Face Datasets Scraper with any app

Connect your dataset metadata to the tools your team already uses:

  • Make - Automate workflows without writing code
  • Zapier - Connect to 5000+ apps instantly
  • GitHub - Trigger runs from your repositories
  • Slack - Get notifications when runs complete
  • Airbyte - Feed data directly into your data pipelines
  • Google Drive - Export results to spreadsheets automatically

You can also use webhooks to trigger downstream actions automatically whenever a new dataset collection run finishes.

Browse our complete collection of data extraction tools for more.

🆘 Need Help?

Check the FAQ section above for answers to common questions. For additional support, visit the Apify Help Center. If you need a custom data extraction solution or have a specific use case in mind, use the Contact section below to reach out directly.

📞 Contact

Contact us to request a new scraper, propose a custom data project, or report a technical issue with this actor at https://tally.so/r/BzdKgA

⚠️ Disclaimer

This Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by Hugging Face or any of its subsidiaries. All trademarks mentioned are the property of their respective owners.