Hugging Face Scraper - Models Datasets Spaces avatar

Hugging Face Scraper - Models Datasets Spaces

Pricing

$5.00 / 1,000 model scrapeds

Go to Apify Store
Hugging Face Scraper - Models Datasets Spaces

Hugging Face Scraper - Models Datasets Spaces

Scrape Hugging Face models, datasets, and Spaces. Extracts metadata, downloads, likes, tags, and usage stats. Ideal for AI model discovery, competitive analysis, and tracking trending ML resources.

Pricing

$5.00 / 1,000 model scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

18 days ago

Last modified

Categories

Share

πŸ€— Hugging Face Scraper β€” AI Models, Datasets & Spaces

Structured data from the world's largest open-source AI hub. $0.005 per item.

Scrape Hugging Face for models, datasets, and Spaces. Search by task, library, author, or keyword. Extract model cards, download counts, likes, tags, pipeline tags, library info, and full metadata. No authentication required β€” powered by Hugging Face's public API.

Perfect for AI market research, competitive intelligence on open-source AI, RAG pipelines over model cards, and monitoring the ML ecosystem in real time.

πŸš€ What does this Actor do?

Hugging Face has become the registry for open-source AI. This Actor turns it into a structured data source you can automate in four modes:

  • models β€” Browse the full model registry. Filter by task (text-generation, image-classification, automatic-speech-recognition and 16 more), author, search query, and sort order (trending, downloads, likes, lastModified, created).
  • datasets β€” Discover ML datasets with metadata: size, downloads, tags, likes.
  • spaces β€” List deployed ML demos and apps on HF Spaces.
  • model_details β€” Deep-dive into specific models by ID. Returns full model cards, pipeline tag, library info, tensor types, and download statistics.

Everything comes back as clean JSON, ready to drop into a vector DB, a dashboard, or a fine-tuning pipeline.

πŸ’‘ Use Cases

1. AI market research & trend tracking

Track which open-source models are gaining traction week-over-week. Run weekly against sort: "trending" and compare deltas.

{
"mode": "models",
"task": "text-generation",
"sort": "trending",
"limit": 100
}

2. Competitive monitoring of AI labs

Watch specific organizations β€” Meta, Google, Mistral, Stability AI, Alibaba, DeepSeek β€” for new releases.

{
"mode": "models",
"author": "meta-llama",
"sort": "lastModified",
"limit": 50
}

3. RAG / fine-tuning corpus from model cards

Pull full model cards for a curated list of models and feed them into a vector store as an "AI knowledge assistant."

{
"mode": "model_details",
"modelIds": [
"meta-llama/Llama-3.1-8B",
"mistralai/Mistral-7B-v0.3",
"google/gemma-2-9b"
]
}

4. ML dataset discovery for training pipelines

Find datasets by task and download volume β€” great for auto-selecting candidates for fine-tuning or evaluation.

{
"mode": "datasets",
"search": "instruction",
"sort": "downloads",
"limit": 50
}

πŸ“Š Output Example

{
"id": "meta-llama/Llama-3.1-8B",
"author": "meta-llama",
"pipeline_tag": "text-generation",
"downloads": 4523891,
"likes": 1253,
"tags": ["pytorch", "safetensors", "llama", "text-generation", "en"],
"created": "2024-06-18T00:00:00.000Z",
"lastModified": "2025-01-15T12:30:00.000Z",
"library_name": "transformers",
"modelCard": "Llama 3.1 is a family of large language models...",
"task": "text-generation"
}

βš™οΈ Input Parameters

ParameterTypeDescription
modeenummodels, datasets, spaces, or model_details (required)
searchstringKeyword search β€” e.g. "llama", "sentiment", "bert"
authorstringFilter by org/user β€” "meta-llama", "google", "mistralai", "openai-community"
taskenum19 ML tasks: text-generation, image-classification, translation, summarization, fill-mask, text-to-image, automatic-speech-recognition, and more
sortenumtrending, downloads, likes, lastModified, created
limitint1–1000 (default 50)
modelIdsarrayFor model_details mode: ["meta-llama/Llama-3-8B", "google/gemma-7b"]

πŸ“€ Output Fields

FieldDescription
idFull model/dataset/space ID (author/name)
authorOrganization or user that published it
pipeline_tagPrimary ML task
downloadsTotal download count
likesCommunity likes
tagsArray of framework, license, language, and architecture tags
library_namePrimary library (transformers, diffusers, sentence-transformers, etc.)
created / lastModifiedISO timestamps for monitoring freshness
modelCardFull README content (in model_details mode)

πŸ’° Pricing & Performance

  • Pay-per-event: $0.005 per item scraped (model, dataset, space, or model detail).
  • Typical monthly cost: $1.50–$5 for weekly tracking of 100–250 top models.
  • Speed: ~100 items/minute in list modes, ~30 items/minute in model_details (each call fetches the full model card).
  • No HF account / token required β€” uses the public API.

πŸ”Œ Integrations

  • Zapier / Make / n8n β€” schedule weekly trend scans and push deltas to Slack, Notion, or Airtable.
  • LangChain / LlamaIndex β€” feed model_details output straight into a RAG pipeline to build an "AI model advisor."
  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) β€” embed modelCard content for semantic search over the open-source AI landscape.
  • Apify SDK / webhooks β€” run on a schedule and POST new trending entries to your own endpoint.
  • Google Sheets / BigQuery β€” export to CSV via Apify's dataset export and build dashboards on top.

❓ FAQ

Do I need a Hugging Face account or token? No. The Actor uses the public HF API β€” no auth, no rate-limit headaches from token scoping.

How fresh is the data? Real-time. Every run hits the HF API live. Trending rankings, download counts, and new releases appear as soon as HF publishes them.

Can I get the full model card text? Yes β€” use mode: "model_details" with modelIds. The Actor fetches each model's full README/model card.

What's the difference between downloads and trending? downloads = all-time cumulative. trending = HF's internal momentum signal (recent downloads + likes velocity). Use trending to catch rising stars before they hit top-downloads lists.

Can I filter by license (Apache, MIT, Llama-license)? Not directly in input, but license shows up in the tags array of each result β€” you can filter client-side.

Why are some model cards empty? A small fraction of models on HF don't ship a README. Those come back with modelCard: "". Everything else is populated.

πŸ”‘ Keywords

Hugging Face scraper, AI model database, ML model tracker, open source AI data, LLM directory, Hugging Face API alternative, model cards extraction, AI trending models, Hugging Face datasets scraper, Hugging Face Spaces scraper, transformer models data, AI ecosystem monitoring, ML model comparison, fine-tuning dataset discovery, AI competitive intelligence, RAG over model cards.

πŸ“ Changelog

  • v1.0 β€” Initial release. 4 modes (models, datasets, spaces, model_details), 19 task filters, 5 sort options, up to 1000 results per run.