Hugging Face Datasets Scraper - AI Dataset Metadata
Pricing
from $2.00 / 1,000 results
Hugging Face Datasets Scraper - AI Dataset Metadata
Scrape Hugging Face dataset search results: dataset IDs, authors, downloads, likes, tags and update timestamps.
Pricing
from $2.00 / 1,000 results
Rating
0.0
(0)
Developer
ben
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
17 hours ago
Last modified
Categories
Share
Scrape structured data from Hugging Face Datasets in one Apify run. This actor turns a public search page or public JSON endpoint into clean rows that are ready for export, enrichment, dashboards, monitoring, and API workflows.
What does this actor do?
The Hugging Face Datasets Scraper - AI Dataset Metadata fetches live records from Hugging Face Datasets, normalizes the response, and pushes one dataset item per result. Instead of handling raw nested API responses, rate-limit retries, field cleanup, pagination details, and export formatting yourself, you get a maintained Apify actor with predictable input and output fields.
The actor is designed for practical data work. It keeps the default run small so Apify's daily checks finish quickly, but you can raise maxResults for production. It does not use a browser, residential proxy, login session, or paid unblocker. That keeps the cost low and the reliability high.
Why use this actor?
Teams often need public metadata in repeatable workflows: package monitoring, research discovery, AI dataset collection, market mapping, compliance review, enrichment, trend tracking, or competitive analysis. The hard part is not running one request; it is keeping the connector consistent, scheduled, documented, exportable, and easy for non-developers to reuse.
This actor gives you that connector. You can run it manually, schedule it, call it from the Apify API, connect it to Make, Zapier, n8n, or download the dataset as JSON, CSV, Excel, XML, RSS, or HTML table.
Input
{"query": "sentiment","maxResults": 25}
query is the keyword, package name, topic, entity, organization, model, or dataset term to search for. maxResults controls how many rows are pushed to the dataset. Start with a small value while testing and increase it once your workflow is stable.
Output
Every result is pushed to the default Apify dataset as a flat JSON object. Field names vary by source, but typical rows include identifiers, names or titles, descriptions, URLs, timestamps, counts, scores, owners, tags, references, and the original search term.
{"name": "example result","description": "Normalized metadata from Hugging Face Datasets","url": "https://example.com/result","source": "Hugging Face Datasets","search": "sentiment"}
Common use cases
Use this actor for source monitoring, public data enrichment, internal search indexes, lead and account research, AI/RAG dataset preparation, technical due diligence, package ecosystem reports, research discovery, SEO content research, dashboards, and recurring CSV exports.
For developer and package sources, it helps track projects, packages, maintainers, download signals, repository links, and descriptions. For research and data sources, it helps collect papers, datasets, organizations, entities, taxonomy records, and metadata that can be joined with your own systems.
Data quality
The actor reads live public data at runtime. HTML snippets are cleaned, nested fields are flattened where useful, and each row includes a source and search field so scheduled runs can be merged safely. Lists are capped to practical sizes in the output to avoid creating oversized records.
Because this actor uses public endpoints, data availability depends on the upstream service. If a result is removed, renamed, or updated upstream, the next run reflects that change. This is useful for monitoring workflows where freshness matters more than static snapshots.
Reliability
This is a direct HTTP actor. It avoids browser automation, CAPTCHA workflows, cookie state, and proxy dependencies. Requests include timeouts, redirects, retries, and a normal browser-like user agent. That makes the actor suitable for Apify scheduled runs and daily store reliability tests.
If the upstream API changes, the actor can be patched while keeping the same Apify input and output workflow for users. Downstream systems should rely on stable identifiers and URLs where available.
Pricing
The actor uses pay-per-event pricing. A small run-start fee covers orchestration, and the result event is charged per dataset item. This keeps small monitoring jobs affordable while allowing larger exports when needed.
FAQ
Does this require an API key?
No. The default workflow uses public endpoints and does not require user credentials.
Can I run it on a schedule?
Yes. Create a saved task with your query and schedule it hourly, daily, weekly, or monthly.
Can I export the data?
Yes. Apify datasets export to JSON, CSV, Excel, XML, RSS, and HTML table. You can also consume results through the Apify API.
Is this a browser scraper?
No. It uses direct HTTP requests for speed, low cost, and reliability.
Can I use it for enrichment?
Yes. Keep the identifier, URL, source, and search fields in your warehouse so you can join results with internal records.
Related actors
You might also like: StepStone Scraper, Open VSX Extensions Scraper, NuGet Package Scraper, CISA KEV Scraper, NVD CVE Scraper, Crossref Papers Scraper, Hugging Face Models Scraper, GitLab Projects Scraper, and Wikipedia Scraper.
Keywords
Hugging Face Datasets scraper, Hugging Face Datasets API, public data scraper, metadata scraper, Apify Hugging Face Datasets, JSON export, CSV export, research data, developer tools data, dataset scraper, monitoring actor, no-code data extraction, automation workflow, business data enrichment
Production workflow tips
For recurring monitoring, create separate saved tasks for your most important topics instead of one giant run. Smaller scheduled runs are easier to compare over time, easier to retry, and cheaper to debug. If you use the output for alerts, compare stable IDs first, then compare descriptions, counts, timestamps, URLs, or status fields.
For data warehouse workflows, store the Apify run ID and the search field with each record. That makes it easy to trace where a row came from, rebuild a historical snapshot, or merge multiple actor outputs into one table without losing source context.
Maintenance approach
This actor intentionally uses official or public endpoints with simple response formats. There is no login to expire and no browser fingerprint to maintain. That makes it a practical part of a larger portfolio of reliable data connectors rather than a fragile one-off script.