Hugging Face Scraper — (Models, Datasets, Spaces, Papers etc.) avatar

Hugging Face Scraper — (Models, Datasets, Spaces, Papers etc.)

Pricing

from $6.00 / 1,000 results

Go to Apify Store
Hugging Face Scraper — (Models, Datasets, Spaces, Papers etc.)

Hugging Face Scraper — (Models, Datasets, Spaces, Papers etc.)

All-in-one Hugging Face Hub scraper. Paste any URL or text query — auto-detects model, dataset, space, paper, user, org, or collection. Deep model card, lineage, evaluation results, dataset configs. MCP-ready. $0.006 per result.

Pricing

from $6.00 / 1,000 results

Rating

0.0

(0)

Developer

Khadin Akbar

Khadin Akbar

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

15 hours ago

Last modified

Share

Hugging Face Scraper — Models, Datasets, Spaces, Papers, Users (All-in-One)

Paste any Hugging Face URL or text query — this actor auto-detects whether it is a model, dataset, space, paper, user, organization, or collection, and returns deep structured data per entity. The single most-queried source by AI coding agents and ML researchers, now in one MCP-ready actor.

What you get

Target you pasteReturns
https://huggingface.co/meta-llama/Llama-3.1-8BFull model card, license, downloads/likes, eval results (model-index), siblings (files), base model, adapter children, quantized children, datasets cited
https://huggingface.co/datasets/squadDataset metadata, license, configs + splits + row counts (via datasets-server), task categories, language, size buckets
https://huggingface.co/spaces/HuggingFaceM4/idefics_playgroundSpace SDK (Gradio/Streamlit/Docker/Static), hardware tier, runtime stage, siblings, README
https://huggingface.co/papers/2310.06825Paper title, abstract, authors with HF user links, upvotes, discussion count, ArXiv ID, publication date
https://huggingface.co/karpathyUser profile (followers, likes, PRO flag, orgs) + portfolio of models/datasets/spaces sorted by your sortBy
https://huggingface.co/meta-llamaOrganization profile + portfolio
https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-65118eb...Collection metadata + every item inside (model/dataset/space/paper references)
Free-text query like qwen 3 instructTop results from the entityType you pick (models / datasets / spaces / papers / all)

Price: $0.006 per result returned + $0.00005 per actor start (per GB memory). Pay-per-event and Pay-per-usage both enabled — pick whichever fits your workload.

Why this actor

Everyone else's Hugging Face scraper hands you 8 fields and stops. This one ships:

  • URL auto-detection across 7 entity types. No mode toggling.
  • Deep model lineage — base models, finetune/adapter children, quantized children (GGUF/AWQ/GPTQ).
  • Evaluation results — parses the full model-index block from the model card so you can rank by benchmark scores, not just download counts.
  • Dataset configs + splits + row counts via the official datasets-server.huggingface.co API — most actors skip this entirely.
  • Spaces hardware & runtime staget4-small, a10g-large, RUNNING, BUILDING, PAUSED.
  • Collections — every curated reading list, leaderboard, or pinned set on the Hub.
  • MCP-first designresponseFormat: "concise" returns ~200 tokens per item so Claude/GPT can sample many results without blowing the context window.
  • Stable, low-cost runtime — pure HTTP against the public HF API. No browser. No proxy churn. 99%+ success rate.

Input

{
"targets": [
"https://huggingface.co/meta-llama/Llama-3.1-8B",
"https://huggingface.co/datasets/squad",
"https://huggingface.co/papers/2310.06825",
"https://huggingface.co/karpathy",
"qwen 3 instruct"
],
"entityType": "models",
"resultsPerTarget": 50,
"sortBy": "downloads",
"filterTask": "text-generation",
"filterLibrary": "transformers",
"filterLanguage": "en",
"includeReadme": true,
"responseFormat": "detailed"
}
  • targets (required) — array of URLs and/or text queries. Mix freely.
  • entityType — when a target is a text query, search this entity type (models / datasets / spaces / papers / all). Ignored for URL targets.
  • resultsPerTarget — cap per target. URL targets always return 1 record (+ child items for collections/profiles). Default 50, max 500.
  • sortBydownloads, likes, modified, trending. Drives search results and profile portfolios.
  • filterTask / filterLibrary / filterLanguage — model/dataset filters (e.g. text-generation, transformers, gguf, en).
  • includeReadme — when true, fetches the full Markdown README. Off this and bulk searches stay cheap.
  • responseFormatdetailed returns every parsed field. concise returns ~200 tokens/item for AI agents.

Optional: higher rate limit

Hugging Face allows ~1,000 requests/hour without auth. Set the HF_TOKEN environment variable on the actor (Console → Settings → Environment variables) with a token from https://huggingface.co/settings/tokens to raise the cap to ~5,000/h.

Output

Mixed dataset — each record has an itemType discriminator (model, dataset, space, paper, user, org, collection, collection_item, search_result). Pre-built dataset views in the Output tab let you slice by entity.

Model record (excerpt)

{
"itemType": "model",
"id": "meta-llama/Llama-3.1-8B",
"url": "https://huggingface.co/meta-llama/Llama-3.1-8B",
"author": "meta-llama",
"downloads": 12345678,
"likes": 4321,
"pipelineTag": "text-generation",
"libraryName": "transformers",
"license": "llama3.1",
"tags": ["llama-3", "text-generation", "facebook"],
"language": ["en"],
"lastModified": "2026-04-22T18:51:00.000Z",
"siblings": [{ "rfilename": "config.json", "size": 1234, "lfs": null }],
"modelIndex": [{ "name": "...", "results": [] }],
"baseModels": [],
"adapterChildren": ["someone/llama-3.1-8b-lora-medical"],
"quantizedChildren": ["bartowski/Meta-Llama-3.1-8B-GGUF"],
"datasetsUsed": ["allenai/c4", "EleutherAI/pile"],
"readme": "# Model Card for Llama 3.1 8B..."
}

Dataset record (excerpt)

{
"itemType": "dataset",
"id": "squad",
"url": "https://huggingface.co/datasets/squad",
"downloads": 987654,
"taskCategories": ["question-answering"],
"language": ["en"],
"sizeCategories": ["10K<n<100K"],
"configs": [
{ "config": "plain_text", "splits": [{ "name": "train", "numExamples": 87599 }, { "name": "validation", "numExamples": 10570 }] }
]
}

Use it from code

JavaScript / TypeScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('khadinakbar/huggingface-all-in-one-scraper').call({
targets: ['https://huggingface.co/meta-llama/Llama-3.1-8B', 'qwen 3 instruct'],
entityType: 'models',
resultsPerTarget: 25,
responseFormat: 'concise',
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient(token="apify_api_...")
run = client.actor("khadinakbar/huggingface-all-in-one-scraper").call(
run_input={
"targets": ["https://huggingface.co/datasets/squad", "image classification"],
"entityType": "all",
"resultsPerTarget": 50,
}
)
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())

From Claude / GPT via MCP

This actor is exposed as apify--huggingface-all-in-one-scraper on the Apify MCP server. Any MCP-capable agent (Claude Desktop, ChatGPT custom GPTs with MCP, Cursor, Windsurf) can call it directly — set responseFormat: "concise" so item tokens stay small.

FAQ

Does it require a Hugging Face account? No. The actor uses public API endpoints. Add an HF_TOKEN env var only if you want higher rate limits.

Can it scrape gated or private repos? No. The scraper only sees what the public API exposes. Gated repos return their public metadata (license, tags) but no file contents. Private repos return 401.

How fresh is the data? Real-time — every record is a fresh API hit, not a cached crawl.

Does it pull discussions / community tabs? Discussions are not part of v1. Papers include their HF discussion count (commentsCount).

What's the difference between this and apify/rag-web-browser? rag-web-browser fetches arbitrary URLs and returns Markdown. This actor returns structured records with typed fields (downloads as int, lastModified as ISO 8601, evaluation results parsed into model-index). Use this when you need data, not prose.

Why is the price higher than some other HF scrapers? You get deeper extraction (model lineage, eval results, dataset configs, full README), URL auto-detection across 7 entity types, and MCP-first design. The cheapest scrapers return ~8 fields with no lineage and no eval.

Hugging Face's public API and content are designed for programmatic access — the Hub publishes llms.txt and OpenAPI specs explicitly for AI consumption. This actor only hits public, unauthenticated endpoints and respects rate limits. You are responsible for complying with the licenses of any models, datasets, or papers you retrieve. Gated content is not bypassed.

Changelog

1.2 (2026-05-29)

  • Fix: user and organization records now report accurate modelsCount, datasetsCount, spacesCount (HF API does not expose x-total-count; uses limit=1000 array length).

1.1 (2026-05-29)

  • Reliability hardening — 620-record brutal test battery, 100% success rate, 0 schema validation failures.
  • Fix: gated: false boolean from HF API normalized through normalizeGated() (was failing detailed-mode pushes for non-gated repos).
  • Fix: dataset counts no longer increment on failed pushes (moved counter after pushData success).
  • Fix: per-record try/catch — one bad portfolio entry no longer fails the whole user/org target.
  • Fix: empty/whitespace targets surface as explicit warnings instead of silent drops.
  • Fix: URL paths with /tree, /blob, /commits, /discussions, /settings suffix strip correctly.
  • Fix: trailing slash + uppercase hostname normalize.
  • Fix: bare https://huggingface.co/ returns clear "not actionable" warning.
  • Add: sanitizeRecord() strips undefined fields before push.
  • Add: dataset_schema gated accepts boolean defensively.
  • Add: concise mode summary field truncates at 600 chars to keep token budget.

1.0 (2026-05-28)

  • Initial release.
  • URL auto-detection across models, datasets, spaces, papers, users, organizations, collections.
  • Deep model card extraction with base/adapter/quantized lineage, evaluation results, dataset citations.
  • Dataset configs + splits + row counts via datasets-server.
  • Space SDK + hardware + runtime stage.
  • concise / detailed response modes for AI-agent vs human consumers.
  • Optional HF_TOKEN for 5K req/h rate limit.