Pricing

from $6.00 / 1,000 results

Hugging Face Scraper — Models, Datasets & Spaces

All-in-one Hugging Face Hub scraper. Paste any URL or text query — auto-detects model, dataset, space, paper, user, org, or collection. Deep model card, lineage, evaluation results, dataset configs. MCP-ready. $0.006 per result.

Pricing

from $6.00 / 1,000 results

Rating

0.0

(0)

Developer

Khadin Akbar

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What you get

Target you paste	Returns
`https://huggingface.co/meta-llama/Llama-3.1-8B`	Full model card, license, downloads/likes, eval results (`model-index`), siblings (files), base model, adapter children, quantized children, datasets cited
`https://huggingface.co/datasets/squad`	Dataset metadata, license, configs + splits + row counts (via `datasets-server`), task categories, language, size buckets
`https://huggingface.co/spaces/HuggingFaceM4/idefics_playground`	Space SDK (Gradio/Streamlit/Docker/Static), hardware tier, runtime stage, siblings, README
`https://huggingface.co/papers/2310.06825`	Paper title, abstract, authors with HF user links, upvotes, discussion count, ArXiv ID, publication date
`https://huggingface.co/karpathy`	User profile (followers, likes, PRO flag, orgs) + portfolio of models/datasets/spaces sorted by your `sortBy`
`https://huggingface.co/meta-llama`	Organization profile + portfolio
`https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-65118eb...`	Collection metadata + every item inside (model/dataset/space/paper references)
Free-text query like `qwen 3 instruct`	Top results from the `entityType` you pick (models / datasets / spaces / papers / all)

Price: $0.006 per result returned + $0.00005 per actor start (per GB memory). Pay-per-event and Pay-per-usage both enabled — pick whichever fits your workload.

Why this actor

Everyone else's Hugging Face scraper hands you 8 fields and stops. This one ships:

URL auto-detection across 7 entity types. No mode toggling.
Deep model lineage — base models, finetune/adapter children, quantized children (GGUF/AWQ/GPTQ).
Evaluation results — parses the full model-index block from the model card so you can rank by benchmark scores, not just download counts.
Dataset configs + splits + row counts via the official datasets-server.huggingface.co API — most actors skip this entirely.
Spaces hardware & runtime stage — t4-small, a10g-large, RUNNING, BUILDING, PAUSED.
Collections — every curated reading list, leaderboard, or pinned set on the Hub.
MCP-first design — responseFormat: "concise" returns ~200 tokens per item so Claude/GPT can sample many results without blowing the context window.
Stable, low-cost runtime — pure HTTP against the public HF API. No browser. No proxy churn. 99%+ success rate.

Input

{
  "targets": [
    "https://huggingface.co/meta-llama/Llama-3.1-8B",
    "https://huggingface.co/datasets/squad",
    "https://huggingface.co/papers/2310.06825",
    "https://huggingface.co/karpathy",
    "qwen 3 instruct"
  ],
  "entityType": "models",
  "resultsPerTarget": 50,
  "sortBy": "downloads",
  "filterTask": "text-generation",
  "filterLibrary": "transformers",
  "filterLanguage": "en",
  "includeReadme": true,
  "responseFormat": "detailed"
}

targets (required) — array of URLs and/or text queries. Mix freely.
entityType — when a target is a text query, search this entity type (models / datasets / spaces / papers / all). Ignored for URL targets.
resultsPerTarget — cap per target. URL targets always return 1 record (+ child items for collections/profiles). Default 50, max 500.
sortBy — downloads, likes, modified, trending. Drives search results and profile portfolios.
filterTask / filterLibrary / filterLanguage — model/dataset filters (e.g. text-generation, transformers, gguf, en).
includeReadme — when true, fetches the full Markdown README. Off this and bulk searches stay cheap.
responseFormat — detailed returns every parsed field. concise returns ~200 tokens/item for AI agents.

Optional: higher rate limit

Hugging Face allows ~1,000 requests/hour without auth. Set the HF_TOKEN environment variable on the actor (Console → Settings → Environment variables) with a token from https://huggingface.co/settings/tokens to raise the cap to ~5,000/h.

Output

Mixed dataset — each record has an itemType discriminator (model, dataset, space, paper, user, org, collection, collection_item, search_result). Pre-built dataset views in the Output tab let you slice by entity.

Model record (excerpt)

{
  "itemType": "model",
  "id": "meta-llama/Llama-3.1-8B",
  "url": "https://huggingface.co/meta-llama/Llama-3.1-8B",
  "author": "meta-llama",
  "downloads": 12345678,
  "likes": 4321,
  "pipelineTag": "text-generation",
  "libraryName": "transformers",
  "license": "llama3.1",
  "tags": ["llama-3", "text-generation", "facebook"],
  "language": ["en"],
  "lastModified": "2026-04-22T18:51:00.000Z",
  "siblings": [{ "rfilename": "config.json", "size": 1234, "lfs": null }],
  "modelIndex": [{ "name": "...", "results": [] }],
  "baseModels": [],
  "adapterChildren": ["someone/llama-3.1-8b-lora-medical"],
  "quantizedChildren": ["bartowski/Meta-Llama-3.1-8B-GGUF"],
  "datasetsUsed": ["allenai/c4", "EleutherAI/pile"],
  "readme": "# Model Card for Llama 3.1 8B..."
}

Dataset record (excerpt)

{
  "itemType": "dataset",
  "id": "squad",
  "url": "https://huggingface.co/datasets/squad",
  "downloads": 987654,
  "taskCategories": ["question-answering"],
  "language": ["en"],
  "sizeCategories": ["10K<n<100K"],
  "configs": [
    { "config": "plain_text", "splits": [{ "name": "train", "numExamples": 87599 }, { "name": "validation", "numExamples": 10570 }] }
  ]
}

Use it from code

JavaScript / TypeScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('khadinakbar/huggingface-all-in-one-scraper').call({
  targets: ['https://huggingface.co/meta-llama/Llama-3.1-8B', 'qwen 3 instruct'],
  entityType: 'models',
  resultsPerTarget: 25,
  responseFormat: 'concise',
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient(token="apify_api_...")
run = client.actor("khadinakbar/huggingface-all-in-one-scraper").call(
    run_input={
        "targets": ["https://huggingface.co/datasets/squad", "image classification"],
        "entityType": "all",
        "resultsPerTarget": 50,
    }
)
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())

From Claude / GPT via MCP

This actor is exposed as apify--huggingface-all-in-one-scraper on the Apify MCP server. Any MCP-capable agent (Claude Desktop, ChatGPT custom GPTs with MCP, Cursor, Windsurf) can call it directly — set responseFormat: "concise" so item tokens stay small.

FAQ

Does it require a Hugging Face account? No. The actor uses public API endpoints. Add an HF_TOKEN env var only if you want higher rate limits.

Can it scrape gated or private repos? No. The scraper only sees what the public API exposes. Gated repos return their public metadata (license, tags) but no file contents. Private repos return 401.

How fresh is the data? Real-time — every record is a fresh API hit, not a cached crawl.

Does it pull discussions / community tabs? Discussions are not part of v1. Papers include their HF discussion count (commentsCount).

What's the difference between this and apify/rag-web-browser? rag-web-browser fetches arbitrary URLs and returns Markdown. This actor returns structured records with typed fields (downloads as int, lastModified as ISO 8601, evaluation results parsed into model-index). Use this when you need data, not prose.

Why is the price higher than some other HF scrapers? You get deeper extraction (model lineage, eval results, dataset configs, full README), URL auto-detection across 7 entity types, and MCP-first design. The cheapest scrapers return ~8 fields with no lineage and no eval.

Legal & TOS

Hugging Face's public API and content are designed for programmatic access — the Hub publishes llms.txt and OpenAPI specs explicitly for AI consumption. This actor only hits public, unauthenticated endpoints and respects rate limits. You are responsible for complying with the licenses of any models, datasets, or papers you retrieve. Gated content is not bypassed.

Changelog

1.2 (2026-05-29)

Fix: user and organization records now report accurate modelsCount, datasetsCount, spacesCount (HF API does not expose x-total-count; uses limit=1000 array length).

1.1 (2026-05-29)

Reliability hardening — 620-record brutal test battery, 100% success rate, 0 schema validation failures.
Fix: gated: false boolean from HF API normalized through normalizeGated() (was failing detailed-mode pushes for non-gated repos).
Fix: dataset counts no longer increment on failed pushes (moved counter after pushData success).
Fix: per-record try/catch — one bad portfolio entry no longer fails the whole user/org target.
Fix: empty/whitespace targets surface as explicit warnings instead of silent drops.
Fix: URL paths with /tree, /blob, /commits, /discussions, /settings suffix strip correctly.
Fix: trailing slash + uppercase hostname normalize.
Fix: bare https://huggingface.co/ returns clear "not actionable" warning.
Add: sanitizeRecord() strips undefined fields before push.
Add: dataset_schema gated accepts boolean defensively.
Add: concise mode summary field truncates at 600 chars to keep token budget.

1.0 (2026-05-28)

Initial release.
URL auto-detection across models, datasets, spaces, papers, users, organizations, collections.
Deep model card extraction with base/adapter/quantized lineage, evaluation results, dataset citations.
Dataset configs + splits + row counts via datasets-server.
Space SDK + hardware + runtime stage.
concise / detailed response modes for AI-agent vs human consumers.
Optional HF_TOKEN for 5K req/h rate limit.

GitHub Scraper — repos, issues, PRs & code search to pair with HF model lineage for full open-source context.
Y Combinator Scraper — YC companies, founders & jobs for AI-startup intelligence next to HF org coverage.
Google Patents Scraper — patents & citations to anchor model claims against patent prior art.
Google Scholar Scraper — papers, citation export & author profiles for the academic side of HF Papers.
ChatGPT GPT Store Scraper — GPT Store catalog & chat counts for downstream LLM-app intelligence.
AI Search Brand Monitor — monitor brand/model mentions across ChatGPT, Perplexity and Gemini.

Hugging Face Model & Dataset Scraper

cloud9_ai/huggingface-scraper

Search and extract ML models and datasets from Hugging Face Hub. Get model cards, download stats, tasks, and architectures. No API key needed.

cloud9

Hugging Face Datasets Scraper - AI Dataset Metadata

benthepythondev/huggingface-datasets-scraper

Scrape Hugging Face dataset search results: dataset IDs, authors, downloads, likes, tags and update timestamps.

Ben

Hugging Face Scraper - Models Datasets Spaces

openclawmara/huggingface-scraper

Scrape Hugging Face models, datasets, and Spaces. Extracts metadata, downloads, likes, tags, and usage stats. Ideal for AI model discovery, competitive analysis, and tracking trending ML resources.

OpenClaw Mara

Hugging Face Models Scraper

fetch_cat/hugging-face-models-scraper

🤗 Scrape public Hugging Face model metadata, downloads, likes, tags, licenses, and update signals for AI market research.

Hanna Nosova

Hugging Face Scraper - Models, Datasets, Papers

logiover/huggingface-hub-intelligence-scraper

Hugging Face data export tool: scrape models, datasets & daily papers without a token. Export to CSV/JSON. A no-login Hugging Face API alternative.

Logiover

Hugging Face Hub API

alizarin_refrigerator-owner/hugging-face-hub

Access the Hugging Face Hub API to search & discover models, datasets & spaces. Search Models: Find ML models by name, task or library Search Datasets: Discover datasets for training & evaluation Search Spaces: Explore ML applications Get Metadata: Retrieve detailed repo information

The Howlers

Hugging Face Insights Scraper — Models, Datasets & Spaces

brilliant_gum/huggingface-insights-scraper

Scrape Hugging Face models, datasets, spaces, and daily papers with downloads, likes, parameters, tags, and growth tracking between runs. Filter by pipeline, library, author, or keyword.

Yuliia Kulakova

Hugging Face Models Scraper - Cheap 🤗🤖🔎

scrapestorm/hugging-face-models-scraper---cheap

🟠 Easily collect Models from Hugging Face Provide one or multiple search keywords and extract structured model data including model name, owner, likes, downloads, tags, last update date, match count & more 🤖📊 Perfect for AI model research, popularity tracking & model ecosystem monitoring 🚀

Storm_Scraper

5.0

Hugging Face Scraper - Trending Models, Datasets & Spaces

arjunannamalai/huggingface-trending-scraper

Scrape trending, most-downloaded and most-liked Hugging Face models, datasets and spaces. Filter by author, task or keyword. No token required.

Arjun Annamalai

Hugging Face Models Scraper - AI/ML Data

benthepythondev/huggingface-models-scraper

Search Hugging Face for AI/ML models or datasets by keyword and get structured data: id, author, task, downloads, likes, library, tags, license and dates. Fast and reliable via the public Hugging Face Hub API. For AI/ML market research, model discovery and trend tracking.

Ben