Hugging Face Scraper — (Models, Datasets, Spaces, Papers etc.)
Pricing
from $6.00 / 1,000 results
Hugging Face Scraper — (Models, Datasets, Spaces, Papers etc.)
All-in-one Hugging Face Hub scraper. Paste any URL or text query — auto-detects model, dataset, space, paper, user, org, or collection. Deep model card, lineage, evaluation results, dataset configs. MCP-ready. $0.006 per result.
Pricing
from $6.00 / 1,000 results
Rating
0.0
(0)
Developer
Khadin Akbar
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
Hugging Face Scraper — Models, Datasets, Spaces, Papers, Users (All-in-One)
Paste any Hugging Face URL or text query — this actor auto-detects whether it is a model, dataset, space, paper, user, organization, or collection, and returns deep structured data per entity. The single most-queried source by AI coding agents and ML researchers, now in one MCP-ready actor.
What you get
| Target you paste | Returns |
|---|---|
https://huggingface.co/meta-llama/Llama-3.1-8B | Full model card, license, downloads/likes, eval results (model-index), siblings (files), base model, adapter children, quantized children, datasets cited |
https://huggingface.co/datasets/squad | Dataset metadata, license, configs + splits + row counts (via datasets-server), task categories, language, size buckets |
https://huggingface.co/spaces/HuggingFaceM4/idefics_playground | Space SDK (Gradio/Streamlit/Docker/Static), hardware tier, runtime stage, siblings, README |
https://huggingface.co/papers/2310.06825 | Paper title, abstract, authors with HF user links, upvotes, discussion count, ArXiv ID, publication date |
https://huggingface.co/karpathy | User profile (followers, likes, PRO flag, orgs) + portfolio of models/datasets/spaces sorted by your sortBy |
https://huggingface.co/meta-llama | Organization profile + portfolio |
https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-65118eb... | Collection metadata + every item inside (model/dataset/space/paper references) |
Free-text query like qwen 3 instruct | Top results from the entityType you pick (models / datasets / spaces / papers / all) |
Price: $0.006 per result returned + $0.00005 per actor start (per GB memory). Pay-per-event and Pay-per-usage both enabled — pick whichever fits your workload.
Why this actor
Everyone else's Hugging Face scraper hands you 8 fields and stops. This one ships:
- URL auto-detection across 7 entity types. No mode toggling.
- Deep model lineage — base models, finetune/adapter children, quantized children (GGUF/AWQ/GPTQ).
- Evaluation results — parses the full
model-indexblock from the model card so you can rank by benchmark scores, not just download counts. - Dataset configs + splits + row counts via the official
datasets-server.huggingface.coAPI — most actors skip this entirely. - Spaces hardware & runtime stage —
t4-small,a10g-large,RUNNING,BUILDING,PAUSED. - Collections — every curated reading list, leaderboard, or pinned set on the Hub.
- MCP-first design —
responseFormat: "concise"returns ~200 tokens per item so Claude/GPT can sample many results without blowing the context window. - Stable, low-cost runtime — pure HTTP against the public HF API. No browser. No proxy churn. 99%+ success rate.
Input
{"targets": ["https://huggingface.co/meta-llama/Llama-3.1-8B","https://huggingface.co/datasets/squad","https://huggingface.co/papers/2310.06825","https://huggingface.co/karpathy","qwen 3 instruct"],"entityType": "models","resultsPerTarget": 50,"sortBy": "downloads","filterTask": "text-generation","filterLibrary": "transformers","filterLanguage": "en","includeReadme": true,"responseFormat": "detailed"}
targets(required) — array of URLs and/or text queries. Mix freely.entityType— when a target is a text query, search this entity type (models/datasets/spaces/papers/all). Ignored for URL targets.resultsPerTarget— cap per target. URL targets always return 1 record (+ child items for collections/profiles). Default 50, max 500.sortBy—downloads,likes,modified,trending. Drives search results and profile portfolios.filterTask/filterLibrary/filterLanguage— model/dataset filters (e.g.text-generation,transformers,gguf,en).includeReadme— when true, fetches the full Markdown README. Off this and bulk searches stay cheap.responseFormat—detailedreturns every parsed field.concisereturns ~200 tokens/item for AI agents.
Optional: higher rate limit
Hugging Face allows ~1,000 requests/hour without auth. Set the HF_TOKEN environment variable on the actor (Console → Settings → Environment variables) with a token from https://huggingface.co/settings/tokens to raise the cap to ~5,000/h.
Output
Mixed dataset — each record has an itemType discriminator (model, dataset, space, paper, user, org, collection, collection_item, search_result). Pre-built dataset views in the Output tab let you slice by entity.
Model record (excerpt)
{"itemType": "model","id": "meta-llama/Llama-3.1-8B","url": "https://huggingface.co/meta-llama/Llama-3.1-8B","author": "meta-llama","downloads": 12345678,"likes": 4321,"pipelineTag": "text-generation","libraryName": "transformers","license": "llama3.1","tags": ["llama-3", "text-generation", "facebook"],"language": ["en"],"lastModified": "2026-04-22T18:51:00.000Z","siblings": [{ "rfilename": "config.json", "size": 1234, "lfs": null }],"modelIndex": [{ "name": "...", "results": [] }],"baseModels": [],"adapterChildren": ["someone/llama-3.1-8b-lora-medical"],"quantizedChildren": ["bartowski/Meta-Llama-3.1-8B-GGUF"],"datasetsUsed": ["allenai/c4", "EleutherAI/pile"],"readme": "# Model Card for Llama 3.1 8B..."}
Dataset record (excerpt)
{"itemType": "dataset","id": "squad","url": "https://huggingface.co/datasets/squad","downloads": 987654,"taskCategories": ["question-answering"],"language": ["en"],"sizeCategories": ["10K<n<100K"],"configs": [{ "config": "plain_text", "splits": [{ "name": "train", "numExamples": 87599 }, { "name": "validation", "numExamples": 10570 }] }]}
Use it from code
JavaScript / TypeScript
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: process.env.APIFY_TOKEN });const run = await client.actor('khadinakbar/huggingface-all-in-one-scraper').call({targets: ['https://huggingface.co/meta-llama/Llama-3.1-8B', 'qwen 3 instruct'],entityType: 'models',resultsPerTarget: 25,responseFormat: 'concise',});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient(token="apify_api_...")run = client.actor("khadinakbar/huggingface-all-in-one-scraper").call(run_input={"targets": ["https://huggingface.co/datasets/squad", "image classification"],"entityType": "all","resultsPerTarget": 50,})items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
From Claude / GPT via MCP
This actor is exposed as apify--huggingface-all-in-one-scraper on the Apify MCP server. Any MCP-capable agent (Claude Desktop, ChatGPT custom GPTs with MCP, Cursor, Windsurf) can call it directly — set responseFormat: "concise" so item tokens stay small.
FAQ
Does it require a Hugging Face account?
No. The actor uses public API endpoints. Add an HF_TOKEN env var only if you want higher rate limits.
Can it scrape gated or private repos? No. The scraper only sees what the public API exposes. Gated repos return their public metadata (license, tags) but no file contents. Private repos return 401.
How fresh is the data? Real-time — every record is a fresh API hit, not a cached crawl.
Does it pull discussions / community tabs?
Discussions are not part of v1. Papers include their HF discussion count (commentsCount).
What's the difference between this and apify/rag-web-browser?
rag-web-browser fetches arbitrary URLs and returns Markdown. This actor returns structured records with typed fields (downloads as int, lastModified as ISO 8601, evaluation results parsed into model-index). Use this when you need data, not prose.
Why is the price higher than some other HF scrapers? You get deeper extraction (model lineage, eval results, dataset configs, full README), URL auto-detection across 7 entity types, and MCP-first design. The cheapest scrapers return ~8 fields with no lineage and no eval.
Legal & TOS
Hugging Face's public API and content are designed for programmatic access — the Hub publishes llms.txt and OpenAPI specs explicitly for AI consumption. This actor only hits public, unauthenticated endpoints and respects rate limits. You are responsible for complying with the licenses of any models, datasets, or papers you retrieve. Gated content is not bypassed.
Changelog
1.2 (2026-05-29)
- Fix: user and organization records now report accurate
modelsCount,datasetsCount,spacesCount(HF API does not exposex-total-count; useslimit=1000array length).
1.1 (2026-05-29)
- Reliability hardening — 620-record brutal test battery, 100% success rate, 0 schema validation failures.
- Fix:
gated: falseboolean from HF API normalized throughnormalizeGated()(was failing detailed-mode pushes for non-gated repos). - Fix: dataset counts no longer increment on failed pushes (moved counter after
pushDatasuccess). - Fix: per-record try/catch — one bad portfolio entry no longer fails the whole user/org target.
- Fix: empty/whitespace targets surface as explicit warnings instead of silent drops.
- Fix: URL paths with
/tree,/blob,/commits,/discussions,/settingssuffix strip correctly. - Fix: trailing slash + uppercase hostname normalize.
- Fix: bare
https://huggingface.co/returns clear "not actionable" warning. - Add:
sanitizeRecord()strips undefined fields before push. - Add:
dataset_schemagatedaccepts boolean defensively. - Add: concise mode summary field truncates at 600 chars to keep token budget.
1.0 (2026-05-28)
- Initial release.
- URL auto-detection across models, datasets, spaces, papers, users, organizations, collections.
- Deep model card extraction with base/adapter/quantized lineage, evaluation results, dataset citations.
- Dataset configs + splits + row counts via
datasets-server. - Space SDK + hardware + runtime stage.
concise/detailedresponse modes for AI-agent vs human consumers.- Optional
HF_TOKENfor 5K req/h rate limit.