Hugging Face Datasets Scraper
Pricing
$5.00/month + usage
Hugging Face Datasets Scraper
Scrape dataset metadata from Hugging Face Hub. Extract names, authors, download counts, likes, trending scores, task categories, size categories, languages, licenses, tags and descriptions. Filter by search query, task type, language, or license. Sort by trending, downloads, likes, or last modified.
Pricing
$5.00/month + usage
Rating
0.0
(0)
Developer

ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
🚀 Hugging Face Datasets Scraper
Extract structured metadata from thousands of Hugging Face datasets in minutes. Whether you are an AI researcher tracking trending datasets, a data engineer building training pipelines, or a business analyst benchmarking the ML ecosystem, this tool gives you the dataset intelligence you need without any manual browsing.
The Hugging Face Datasets Scraper connects directly to Hugging Face's public catalog and returns rich metadata for every dataset matching your search. Filter by task type, language, license, and more to get precisely the data you need. Results are instantly downloadable as CSV, Excel, or JSON, ready for further analysis or integration into your workflows.
No sign-up on Hugging Face is required. No coding knowledge is needed. Simply set your filters, run the tool, and download your results.
✨ What Does It Do
The tool collects detailed metadata for each Hugging Face dataset, including:
- Dataset ID and name - The unique identifier and human-readable name for each dataset, so you can reference and cite datasets accurately
- Direct URL - A link to the dataset's page on Hugging Face, giving you instant access to documentation and download options
- Author - The organization or individual who published the dataset, enabling you to track activity from specific research groups or companies
- Download count - Total number of times the dataset has been downloaded, so you can gauge adoption and community trust at a glance
- Likes - Community engagement score showing how many users have liked the dataset, helping you identify quality data sources quickly
- Trending score - A real-time popularity metric showing which datasets are gaining traction, so you can stay ahead of emerging trends in AI research
- Task categories - The ML tasks the dataset is designed for (e.g., text-classification, image-segmentation), enabling precise filtering for your project requirements
- Size categories - Dataset size ranges (e.g., 1K-10K, 10K-100K rows), so you can select datasets that match your compute and storage budget
- Languages - Language codes for multilingual or language-specific datasets, helping global teams find the right training data for their target markets
- License - The open-source or commercial license governing dataset use, so compliance and legal teams can quickly clear datasets for production use
- Tags - Additional descriptive tags attached to the dataset, giving you richer search and filtering options beyond standard categories
- Description - The dataset's overview text as written by its author, giving you context to evaluate fit without leaving your workflow
- Last modified and created dates - Timestamps for tracking freshness and age, so you can prioritize recently updated datasets in fast-moving research areas
- Scraped at timestamp - When this record was collected, for audit and refresh tracking
🔧 Input
Configure the tool using these simple settings. All fields are optional - run it with no filters to get the most trending datasets.
| Field | Description |
|---|---|
| Max Items | How many datasets to return. Supports up to 1,000,000 records per run. |
| Search Query | A keyword or phrase to search for (e.g., "image segmentation", "clinical trials", "LLM pretraining"). |
| Task Category | Filter to a specific ML task type (e.g., text-classification, text-generation, image-classification). |
| Language | Filter by language code (e.g., en for English, zh for Chinese, fr for French). |
| License | Filter by license type (e.g., apache-2.0, mit, cc-by-4.0). |
| Sort By | Sort results by: trendingScore, downloads, likes, or lastModified. Default is trending score. Results are always sorted in descending order (highest first). |
Example input:
{"maxItems": 50,"query": "text classification","taskCategory": "text-classification","language": "en","license": "apache-2.0","sort": "downloads"}
📊 Output
Each dataset record is returned as a structured JSON object. Here is a realistic example:
{"imageUrl": "https://cdn-avatars.huggingface.co/v1/production/uploads/...","id": "stanfordnlp/sst2","name": "SST-2 (Stanford Sentiment Treebank)","url": "https://huggingface.co/datasets/stanfordnlp/sst2","author": "stanfordnlp","downloads": 1842039,"likes": 312,"trendingScore": 8.74,"taskCategories": ["text-classification"],"sizeCategories": ["10K<n<100K"],"language": ["en"],"license": "apache-2.0","tags": ["text", "sentiment", "nlp", "classification"],"private": false,"gated": false,"disabled": false,"description": "The Stanford Sentiment Treebank consists of sentences from movie reviews...","lastModified": "2024-03-10T11:22:00.000Z","createdAt": "2022-06-15T08:00:00.000Z","scrapedAt": "2026-02-19T09:00:00.000Z"}
Field glossary:
id- Unique dataset identifier inauthor/dataset-nameformatname- Human-friendly display nameurl- Full link to the dataset pageauthor- Publisher of the datasetdownloads- Cumulative download count (great for gauging adoption)likes- Community upvote counttrendingScore- Real-time trending rank assigned by Hugging FacetaskCategories- List of ML tasks this dataset supportssizeCategories- Approximate row count bracketslanguage- List of language codeslicense- License identifiertags- Descriptive topic tagsdescription- Dataset summary textlastModifiedandcreatedAt- ISO 8601 timestampsscrapedAt- Timestamp of when this record was collected
Download your results in CSV, Excel, or JSON directly from the Apify platform.
💎 Why Choose the Hugging Face Datasets Scraper?
Comprehensive metadata in one place. Manually browsing Hugging Face to gather dataset statistics is slow and impractical at scale. This tool collects every relevant metadata field in a single automated run.
Powerful filtering options. Most manual searches give you generic results. With this tool, you can combine keyword search, task category, language, license, and sort order to find exactly the datasets you need.
No authentication required. The tool works entirely against Hugging Face's public catalog. You do not need an account, access token, or special permissions.
Reliable, paginated collection. The tool uses cursor-based pagination with automatic retry logic to ensure no datasets are missed even when collecting tens of thousands of records.
Instant, structured output. Results come back as clean, normalized JSON - no raw HTML to parse or inconsistent field names to reconcile.
Built for large-scale research. Collect up to 1,000,000 dataset records in a single run, making this ideal for ecosystem-wide benchmarking and trend analysis.
📋 How to Use
No technical skills required. Follow these steps:
- Sign Up: Create a free account w/ $5 credit
- Find the Actor: Search for "Hugging Face Datasets Scraper" in the Apify Store, or navigate directly to the actor page.
- Set Your Filters: Enter your search query, task category, language, license, and how many results you want.
- Run the Tool: Click the "Start" button. The tool will begin collecting data immediately.
- Download Your Results: Once the run completes, download your dataset as CSV, Excel, or JSON from the Storage tab.
The tool handles pagination automatically so you can collect exactly as many datasets as you need.
🎯 Business Use Cases
AI Researchers and Data Scientists
- Discover trending datasets in your target domain before starting a new project
- Track download velocity to identify which datasets the community is actively using
- Compare dataset sizes and licenses to find the best fit for your training pipeline
- Monitor newly published datasets in a specific language or task area
ML Platform and Tooling Teams
- Audit the Hugging Face ecosystem for coverage gaps across tasks or languages
- Build internal dataset catalogs by pulling structured metadata in bulk
- Track competitive activity by monitoring which organizations publish the most datasets
- Identify underserved niches where new datasets could gain rapid adoption
Market Researchers and Analysts
- Measure the growth of the open-source AI data ecosystem over time
- Benchmark dataset publishing activity by organization or research group
- Analyze license distribution trends across the ML community
- Generate reports on the most-downloaded datasets in specific categories
Procurement and Compliance Teams
- Filter datasets by license to identify those cleared for commercial use
- Collect metadata for dataset due diligence before integrating into production systems
- Track gated or restricted datasets that may require additional agreements
Educators and Course Creators
- Find highly-liked, well-documented datasets suitable for teaching data science
- Identify multilingual datasets for international curriculum development
- Build curated lists of beginner-friendly datasets by task category
❓ FAQ
How does the tool work? The tool queries Hugging Face's public dataset catalog using their openly available listing interface. It automatically pages through results, applies your filters, and collects metadata for each matching dataset. No coding or technical setup is needed on your part.
How accurate is the data? The tool collects data directly from Hugging Face's live catalog, so download counts, likes, and trending scores reflect the state of the platform at the time of your run. For the freshest data, simply re-run the tool.
Can I collect data on a schedule? Yes. Apify supports scheduled runs. Set your actor to run daily, weekly, or at any custom interval to keep your dataset inventory up to date automatically.
Is there a limit on how many datasets I can collect? The tool supports up to 1,000,000 records per run. Set the Max Items field to control how many results you receive.
Do I need a Hugging Face account? No. The tool accesses Hugging Face's publicly available dataset catalog and does not require any login, token, or special permission.
What if I need help or have a custom requirement? Check the FAQ above, visit the Apify support page, or use the Contact section below to reach out for custom data projects.
🔗 Integrate Hugging Face Datasets Scraper with any app
Connect your dataset metadata to the tools your team already uses:
- Make - Automate workflows without writing code
- Zapier - Connect to 5000+ apps instantly
- GitHub - Trigger runs from your repositories
- Slack - Get notifications when runs complete
- Airbyte - Feed data directly into your data pipelines
- Google Drive - Export results to spreadsheets automatically
You can also use webhooks to trigger downstream actions automatically whenever a new dataset collection run finishes.
🔗 Recommended Actors
- Hugging Face Model Scraper - Collect metadata on AI models published to Hugging Face
- AWS Marketplace Scraper - Extract listings, vendor details, and product data from AWS Marketplace
- Hubspot Marketplace Scraper - Collect app and integration listings from the HubSpot ecosystem
- Stripe App Marketplace Scraper - Gather app details and publisher data from the Stripe App Marketplace
- HTML to JSON Smart Parser - Convert any web page into structured JSON using AI-powered extraction
Browse our complete collection of data extraction tools for more.
🆘 Need Help?
Check the FAQ section above for answers to common questions. For additional support, visit the Apify Help Center. If you need a custom data extraction solution or have a specific use case in mind, use the Contact section below to reach out directly.
📞 Contact
Contact us to request a new scraper, propose a custom data project, or report a technical issue with this actor at https://tally.so/r/BzdKgA
⚠️ Disclaimer
This Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by Hugging Face or any of its subsidiaries. All trademarks mentioned are the property of their respective owners.