# HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors (`makework36/huggingface-hub-scraper`) Actor

Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.

- **URL**: https://apify.com/makework36/huggingface-hub-scraper.md
- **Developed by:** [deusex machine](https://apify.com/makework36) (community)
- **Categories:** AI, Business, Lead generation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $2.50 / 1,000 huggingface records

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## HuggingFace Hub Scraper — Models, Datasets, Spaces, Papers & Authors

Scrape the **HuggingFace Hub** with 30+ fields per record. Use this HuggingFace scraper as a no-auth, no-rate-limit alternative to the official HuggingFace Hub API: search models, datasets, spaces and daily papers, filter by author, task, library, language, license, downloads or likes, parse the flat tag array into structured columns, and export everything to CSV, JSON, Excel or a queryable database.

If you have tried building anything on top of the HuggingFace Hub API you already know the friction: paginated REST endpoints with inconsistent shapes, tags packed into a flat string array, no first-class downloads-stats endpoint, and author profiles that require yet another HTTP call. This actor unifies all of that into one schema, runs entirely against the public HF endpoints (no `HF_TOKEN`, no API key) and adds an optional web-enrichment step for outreach.

> 💡 **Looking for HuggingFace data, an HF model finder, an HF dataset list, or a way to convert a HuggingFace dataset to CSV?** This is the actor. It supports the four main HF resources — models, datasets, spaces, papers — plus bulk lookup by ID.

---

### 🚀 Why this HuggingFace scraper

- **30+ structured fields per record** — `id`, `author`, `pipelineTag`, `library`, `parameters`, `usedStorageBytes`, `inferenceStatus`, `gated`, `widgetData`, `spacesUsing`, `siblings`, `config`, `cardData`, `arxivPapers`, `datasetsUsed`, and more
- **Structured tag parsing** — the flat `tags` array gets split into `license`, `languages`, `datasetsUsed`, `arxivPapers`, `region`, `hardwareCompatible`, `frameworks`
- **Author / organization profile** — followers, `isPro`, `numModels`, `numDatasets`, `numSpaces`, `numPapers`, list of organizations
- **Web enrichment** — for every unique author, find their personal or company website, LinkedIn, Facebook and secondary emails via a SERP fetcher (no API key)
- **5 modes** — Models, Datasets, Spaces, Daily Papers, plus bulk Lookup-by-IDs
- **Filters that actually work** — author / organization, pipeline tag (task), library, language, license, SDK (for spaces), minimum downloads, minimum likes
- **Sorting** — trending score (fresh hype), downloads, likes, last modified, created at
- **Outputs** — Apify Dataset → CSV, JSON, Excel, XML, RSS, HTML

Built for AI researchers, ML platform teams, dev-tools founders building on top of HuggingFace, recruiters sourcing ML talent, VCs mapping the foundation-model landscape, and DevRel teams running outreach to model authors.

---

### 📊 What this HuggingFace Hub Scraper extracts

| Field | Description |
| --- | --- |
| `id` | Full ID (`author/name`) |
| `name` | Short name without author prefix |
| `author` | Author or organization handle (e.g. `meta-llama`, `mistralai`) |
| `type` | `model` / `dataset` / `space` / `paper` |
| `pipelineTag` | Primary task (text-generation, image-classification, …) |
| `library` | Library (transformers, diffusers, sentence-transformers, gguf, …) |
| `tags` | Raw flat tags array (HF format) |
| `tagsStructured` | Parsed object: `{ license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks }` |
| `downloads` | Total downloads (lifetime) |
| `downloadsAllTime` | Same as above; alias for compatibility |
| `likes` | Total likes |
| `trendingScore` | HF's own trending signal |
| `createdAt` | Creation timestamp |
| `lastModified` | Last modification timestamp |
| `private` | Whether the record is private (always `false` for public scrape) |
| `gated` | Whether the model is gated (requires acceptance) |
| `disabled` | Whether the record is disabled |
| `inferenceStatus` | Inference API availability (`live`, `loading`, `error`) |
| `parameters` | Number of parameters (when published in `safetensors` / `config.json`) |
| `usedStorageBytes` | Model artifact size on disk |
| `widgetData` | Widget config (when present) |
| `spacesUsing` | Count of Spaces referencing this model |
| `siblings` | File list inside the repo (name + size) |
| `config` | Parsed `config.json` (model architecture, vocab size, hidden size, …) |
| `cardData` | Parsed YAML front-matter of the README ("model card") |
| `arxivPapers` | Array of arXiv IDs declared in tags or cardData |
| `datasetsUsed` | Array of datasets declared as training sources |
| `frameworks` | Frameworks (pytorch, tensorflow, jax, ggml, …) |
| `license` | Top-level license from cardData |
| `languages` | ISO codes (en, es, multilingual, …) |
| `authorProfile` | `{ followers, isPro, numModels, numDatasets, numSpaces, numPapers, orgs[] }` |
| `enrichment` | `{ website, linkedin, facebook, twitter, emails[] }` |
| `url` | Canonical `huggingface.co/...` URL |

For Spaces, the schema adds `sdk` (docker / gradio / streamlit / static), `runtime`, `models` and `datasets` declared in `README.md`. For Papers, the schema adds `title`, `summary`, `arxivId`, `upvotes`, `commentsCount`, `submittedBy`.

---

### 🎯 Search modes

#### 1. `models` — HuggingFace model search

Find models with all the standard filters and sort by trending. The trending score is HF's own hype signal (combines fresh downloads + likes + Spaces usage).

```json
{
  "searchType": "models",
  "pipelineTag": "text-generation",
  "minDownloads": 10000,
  "sort": "trendingScore",
  "maxResults": 100,
  "parseTagsStructured": true,
  "includeAuthorProfile": true
}
````

Common pipeline tags: `text-generation`, `text-classification`, `feature-extraction`, `sentence-similarity`, `image-classification`, `image-to-image`, `text-to-image`, `automatic-speech-recognition`, `text-to-speech`, `translation`, `summarization`, `question-answering`, `token-classification`, `object-detection`, `depth-estimation`.

Common library filters: `transformers`, `diffusers`, `sentence-transformers`, `gguf`, `llama.cpp`, `mlx`, `coreml`, `tensorrt-llm`.

#### 2. `datasets` — HuggingFace dataset search and export

Use the dataset mode to discover or audit training data, or to build a HuggingFace-to-CSV pipeline. The actor returns the dataset metadata; for the actual rows, point your downstream tooling at the canonical `huggingface.co/datasets/...` resolver.

```json
{
  "searchType": "datasets",
  "searchQuery": "instruction",
  "language": "en",
  "minDownloads": 1000,
  "sort": "downloads",
  "maxResults": 200
}
```

Common queries: `instruction tuning`, `dpo`, `code`, `medical`, `legal`, `multilingual`, `image-text`, `function-calling`, `tool-use`, `safety`, `red-teaming`.

#### 3. `spaces` — HuggingFace Space discovery

Find Gradio / Streamlit / Docker / static Spaces. Use this mode for competitive intel on AI demos, recruiter sourcing on ML engineers shipping public apps, or for building a "best Spaces of the week" feed.

```json
{
  "searchType": "spaces",
  "sdk": "gradio",
  "minLikes": 100,
  "sort": "trendingScore",
  "maxResults": 100
}
```

#### 4. `papers` — HuggingFace Daily Papers

The Daily Papers section curates community-submitted arXiv papers with AI-written summaries and upvote counts. This actor returns title, summary, arxivId, upvotes, comments and the submitting author.

```json
{
  "searchType": "papers",
  "sort": "trendingScore",
  "maxResults": 50
}
```

#### 5. `byIds` — Bulk HuggingFace lookup by ID

Hand the actor a list of `author/name` IDs and it returns the full record for each, across models / datasets / spaces. Perfect for enriching a CSV you already have, or auditing a leaderboard.

```json
{
  "searchType": "byIds",
  "ids": [
    "meta-llama/Llama-3.1-8B-Instruct",
    "mistralai/Mistral-7B-Instruct-v0.3",
    "Qwen/Qwen2.5-7B-Instruct",
    "google/gemma-2-9b-it",
    "microsoft/Phi-3.5-mini-instruct"
  ],
  "includeAuthorProfile": true
}
```

***

### 💡 Use cases

This HuggingFace scraper is designed for **AI competitive intelligence, ML lead generation, talent sourcing, and dataset engineering**.

- **AI competitive intelligence** — track every new fine-tune of Llama, Mistral, Qwen, Gemma, Phi. Filter by license, parameters, framework, dataset and pull the author profile to know who is shipping
- **ML lead generation** — find every author who released a popular model in your niche (RAG, voice, vision, robotics) and reach out with the enriched website + LinkedIn
- **Recruiter sourcing for ML engineers** — verified, public proof-of-work + author profile + secondary emails. Beats LinkedIn Recruiter for AI roles
- **VC ecosystem mapping** — combine `numModels` + `numDatasets` + `followers` per organization to surface fast-growing AI labs and emerging research groups
- **Trending model digest / newsletter** — daily run sorted by `trendingScore` produces a clean "what's hot on HuggingFace today" feed
- **Foundation-model leaderboard** — pull every `text-generation` model with `parameters` populated and rank by your own criteria (parameters + downloads + license)
- **Dataset audit and lineage** — for any model, the parsed `cardData` includes the dataset(s) it was trained on. Build a model-to-dataset graph
- **Convert HuggingFace dataset to CSV** — get the dataset metadata, then download the raw files from the resolver
- **Build a Spaces showcase** — top Gradio Spaces by likes, deduplicated by author. Great for AI tool directories
- **Brand monitoring on HuggingFace** — find every model / dataset mentioning your company name, your paper, or your API as an integration

***

### 🧾 Example output

A single record from a `byIds: ["meta-llama/Llama-3.1-8B-Instruct"]` run (truncated):

```json
{
  "id": "meta-llama/Llama-3.1-8B-Instruct",
  "name": "Llama-3.1-8B-Instruct",
  "author": "meta-llama",
  "type": "model",
  "pipelineTag": "text-generation",
  "library": "transformers",
  "tagsStructured": {
    "license": "llama3.1",
    "languages": ["en", "de", "fr", "it", "pt", "hi", "es", "th"],
    "datasetsUsed": [],
    "arxivPapers": ["2407.21783"],
    "region": ["us"],
    "frameworks": ["pytorch", "safetensors"]
  },
  "downloads": 4823910,
  "likes": 3915,
  "trendingScore": 88.4,
  "createdAt": "2024-07-18T08:54:01.000Z",
  "lastModified": "2026-04-22T16:30:11.000Z",
  "gated": true,
  "parameters": 8030261248,
  "usedStorageBytes": 16060522496,
  "spacesUsing": 1342,
  "authorProfile": {
    "followers": 24117,
    "isPro": false,
    "numModels": 39,
    "numDatasets": 4,
    "numSpaces": 1,
    "orgs": ["meta", "facebook"]
  },
  "url": "https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"
}
```

***

### 🆚 Compared to alternatives

| Tool | Maintainer emails | Downloads stats | Bulk lookup | Tag parsing | Web enrichment | Cost |
| --- | --- | --- | --- | --- | --- | --- |
| **HuggingFace Hub Scraper** (this actor) | ✅ via enrichment | ✅ Lifetime + trending | ✅ Up to 5,000 | ✅ Structured | ✅ Optional | Pay-per-event |
| `huggingface_hub` Python SDK | ❌ | ⚠️ Per call | ⚠️ Loops only | ❌ | ❌ | Free, slow |
| HuggingFace REST API | ❌ | ⚠️ Per call | ⚠️ Loops only | ❌ | ❌ | Free, rate-limited |
| Papers With Code | ❌ | ❌ | ⚠️ | ❌ | ❌ | Free |
| OpenReview scrapers | ❌ | ❌ | ❌ | ❌ | ❌ | Free |

If you only need 10 records, the official SDK is fine. For thousands of records, structured tags, downloads stats, author profiles and email enrichment in one run, this actor is the fastest path.

***

### ⚙️ Input parameters reference

| Parameter | Type | Default | Description |
| --- | --- | --- | --- |
| `searchType` | string enum | `models` | `models` / `datasets` / `spaces` / `papers` / `byIds` |
| `ids` | string\[] | — | Used with `byIds`. `author/name` per line |
| `searchQuery` | string | — | Free-text across IDs and tags |
| `author` | string | — | Filter by author / organization (`meta-llama`, `google`, `mistralai`) |
| `pipelineTag` | string | — | Task (`text-generation`, `text-classification`, …) |
| `library` | string | — | Library (`transformers`, `diffusers`, …) |
| `language` | string | — | ISO code (`en`, `es`, `multilingual`, …) |
| `license` | string | — | License filter (`apache-2.0`, `mit`, `llama3.1`, …) |
| `sdk` | string | — | Spaces only: `docker` / `gradio` / `streamlit` / `static` |
| `minDownloads` | integer | — | Drop records below this download count |
| `minLikes` | integer | — | Drop records below this likes count |
| `sort` | string enum | `trendingScore` | `trendingScore` / `downloads` / `likes` / `lastModified` / `createdAt` |
| `maxResults` | integer | `100` | Hard cap (1–5,000) |
| `parseTagsStructured` | boolean | `true` | Split flat tags into structured fields |
| `includeAuthorProfile` | boolean | `false` | Fetch author / org profile |
| `enrichWithGoogle` | boolean | `false` | Find website + LinkedIn + secondary emails per author |
| `enrichLimit` | integer | `50` | Max unique authors to enrich (1–1,000) |
| `proxyConfig` | proxy | residential | Used for enrichment only |

***

### 💰 Pricing & cost

Pay-per-event:

- **Per record returned** — small fee, linear with results
- **Per enriched author** — only when `enrichWithGoogle: true`, capped by `enrichLimit`

A 1,000-model pull without enrichment is essentially free. With author profile + enrichment on 50 unique authors, you stay under a few dollars per run.

The actor only billing-events when a record actually lands in the Dataset. Retries, rate-limit backoffs and partial failures are not charged.

***

### ❓ Frequently asked questions

**Is this an official HuggingFace product?**
No. It calls the same **public** `huggingface.co/api` endpoints that the official `huggingface_hub` Python SDK uses. No `HF_TOKEN` required.

**Do you respect HuggingFace terms of service?**
Yes. We only read public endpoints. We add polite delays and exponential backoff on 429 responses.

**Can I get the model weights / dataset files themselves?**
No. The actor returns metadata only (which includes the file list via `siblings`). To download the binary files, use HuggingFace's resolver (`https://huggingface.co/<id>/resolve/main/<file>`) with the standard `huggingface_hub` SDK.

**How fresh is the data?**
Live. Every request hits the HuggingFace Hub in real time.

**Can I convert a HuggingFace dataset to CSV?**
The actor returns dataset metadata (including the `siblings` list, which is the file manifest). For the actual rows, download the Parquet / JSON / CSV files declared in `siblings` and convert as you wish.

**What is `trendingScore`?**
HuggingFace's own hype signal. It combines recent downloads, likes, and Spaces usage to rank "what is hot right now". Useful for newsletter automation.

**Why is `parameters` sometimes `null`?**
HuggingFace populates `parameters` from a model's `safetensors` index or `config.json`. Some older models or non-transformer models don't publish that, so the field is `null`.

**How do I find the most popular HuggingFace models?**
Set `searchType: "models"`, `sort: "downloads"`, optionally `pipelineTag` to scope by task, and increase `maxResults`. For "fresh hype" use `sort: "trendingScore"`.

**Can I scrape gated models?**
You can read their metadata. The `gated: true` field tells you the weights require user acceptance. The scraper does not bypass any gating.

**Does the enrichment really find author emails?**
Yes, when the author has published an email anywhere on their website, GitHub, LinkedIn or academic page. The SERP fetcher follows the same approach as Apollo / Hunter, applied to AI researchers and ML engineers.

**Can I run this on a schedule?**
Yes. Apify Schedules supports cron expressions. A daily run sorted by `trendingScore` produces a "what's new in AI today" feed.

**How does this compare to the `huggingface_hub` Python SDK?**
The SDK is great for single calls in a Python script. For bulk extraction (1K+ records), structured tags, downloads stats, author profiles and email enrichment in one run, this actor is much faster and exports straight to CSV / JSON / Excel.

**Can I integrate the actor with Claude, Cursor or other AI agents?**
Yes — call the actor via the Apify API from your agent or use Apify's MCP server wrapper. I also publish dedicated MCP server actors (see below).

***

### 🔗 Other actors by makework36

Useful companions for the AI / ML stack:

- [Reddit MCP Server](https://apify.com/makework36/reddit-mcp-server) — Reddit access for Claude, Cursor, ChatGPT, Codex
- [Flight Price MCP Server](https://apify.com/makework36/flight-mcp-server) — flight prices for AI agents
- [Skyscanner MCP Server](https://apify.com/makework36/skyscanner-mcp-server) — flight search MCP
- [Airbnb MCP Server](https://apify.com/makework36/airbnb-mcp-server) — vacation-rental data for AI agents
- [NPM Package Scraper](https://apify.com/makework36/npm-package-scraper) — JavaScript ecosystem data + maintainer emails
- [Lovable Sites Scraper](https://apify.com/makework36/lovable-sites-scraper) — discover `.lovable.app` AI-built apps
- [StackOverflow Scraper](https://apify.com/makework36/stackoverflow-scraper) — questions, answers and tags
- [Website Email & Contact Finder](https://apify.com/makework36/email-finder-scraper) — extract emails from any URL
- [Reddit Product Research Scraper](https://apify.com/makework36/reddit-product-research) — reviews and recommendations
- [Reddit SaaS Leads Scraper](https://apify.com/makework36/reddit-leads-saas) — startup pain points and early adopters
- [Substack Scraper](https://apify.com/makework36/substack-scraper) — newsletter posts and authors
- [Facebook Ad Library Scraper](https://apify.com/makework36/facebook-adlib-scraper) — competitor ad intelligence

***

### 📝 Changelog

- **v0.1** — Initial release. Five search modes (models / datasets / spaces / papers / byIds), structured tag parsing, author profile, optional Google enrichment.

***

### 🛠️ Support

Missing a field, hit a bug, or want a new mode? Open an issue or message me directly from the Apify Console. I respond fast and ship fixes within hours for paying users.

# Actor input Schema

## `searchType` (type: `string`):

Which HuggingFace resource to search.

## `ids` (type: `array`):

Used only with searchType = byIds. Format: 'author/name' for models/datasets/spaces.

## `searchQuery` (type: `string`):

Free-text search across IDs and tags.

## `author` (type: `string`):

Filter by author (e.g. 'meta-llama', 'google', 'mistralai').

## `pipelineTag` (type: `string`):

Filter by task: text-generation, text-classification, automatic-speech-recognition, image-classification, sentence-similarity, embeddings, etc.

## `library` (type: `string`):

Filter by library: transformers, diffusers, sentence-transformers, gguf, llama.cpp, etc.

## `language` (type: `string`):

ISO code (en, es, multilingual, ja, ...).

## `license` (type: `string`):

Filter by license: apache-2.0, mit, llama3.1, openrail, cc-by-nc-4.0, etc.

## `sdk` (type: `string`):

Filter spaces by SDK: docker, gradio, streamlit, static.

## `minDownloads` (type: `integer`):

Skip records below this download count.

## `minLikes` (type: `integer`):

Skip records below this likes count.

## `sort` (type: `string`):

Sort criterion for the result list.

## `maxResults` (type: `integer`):

Hard cap on records to return.

## `parseTagsStructured` (type: `boolean`):

If true, splits the flat tags array into: license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks.

## `includeAuthorProfile` (type: `boolean`):

For each record, fetch the author/org overview (followers, isPro, numModels, orgs).

## `enrichWithGoogle` (type: `boolean`):

For each unique author, find their personal/company website, LinkedIn, Facebook, and secondary emails using a web search (no API key required).

## `enrichLimit` (type: `integer`):

Max number of UNIQUE authors to enrich (not records).

## `proxyConfig` (type: `object`):

Apify proxy. Used for enrichment only.

## Actor input object example

```json
{
  "searchType": "models",
  "ids": [],
  "sort": "trendingScore",
  "maxResults": 100,
  "parseTagsStructured": true,
  "includeAuthorProfile": false,
  "enrichWithGoogle": false,
  "enrichLimit": 50,
  "proxyConfig": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}
```

# Actor output Schema

## `items` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "ids": []
};

// Run the Actor and wait for it to finish
const run = await client.actor("makework36/huggingface-hub-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "ids": [] }

# Run the Actor and wait for it to finish
run = client.actor("makework36/huggingface-hub-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "ids": []
}' |
apify call makework36/huggingface-hub-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=makework36/huggingface-hub-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "HuggingFace Hub Scraper - Models, Datasets, Spaces & Authors",
        "description": "Scrape HuggingFace Hub: models, datasets, spaces. 30+ fields per record, trending filters, author profiles, parsed tags, web enrichment for emails & websites.",
        "version": "1.0",
        "x-build-id": "CFVDfFZcVJP2gK96Y"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/makework36~huggingface-hub-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-makework36-huggingface-hub-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/makework36~huggingface-hub-scraper/runs": {
            "post": {
                "operationId": "runs-sync-makework36-huggingface-hub-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/makework36~huggingface-hub-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-makework36-huggingface-hub-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "searchType": {
                        "title": "What to scrape",
                        "enum": [
                            "models",
                            "datasets",
                            "spaces",
                            "papers",
                            "byIds"
                        ],
                        "type": "string",
                        "description": "Which HuggingFace resource to search.",
                        "default": "models"
                    },
                    "ids": {
                        "title": "Item IDs (one per line)",
                        "type": "array",
                        "description": "Used only with searchType = byIds. Format: 'author/name' for models/datasets/spaces.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "searchQuery": {
                        "title": "Search query (full-text)",
                        "type": "string",
                        "description": "Free-text search across IDs and tags."
                    },
                    "author": {
                        "title": "Author / Organization filter",
                        "type": "string",
                        "description": "Filter by author (e.g. 'meta-llama', 'google', 'mistralai')."
                    },
                    "pipelineTag": {
                        "title": "Pipeline tag (task)",
                        "type": "string",
                        "description": "Filter by task: text-generation, text-classification, automatic-speech-recognition, image-classification, sentence-similarity, embeddings, etc."
                    },
                    "library": {
                        "title": "Library",
                        "type": "string",
                        "description": "Filter by library: transformers, diffusers, sentence-transformers, gguf, llama.cpp, etc."
                    },
                    "language": {
                        "title": "Language",
                        "type": "string",
                        "description": "ISO code (en, es, multilingual, ja, ...)."
                    },
                    "license": {
                        "title": "License",
                        "type": "string",
                        "description": "Filter by license: apache-2.0, mit, llama3.1, openrail, cc-by-nc-4.0, etc."
                    },
                    "sdk": {
                        "title": "Space SDK (for spaces only)",
                        "type": "string",
                        "description": "Filter spaces by SDK: docker, gradio, streamlit, static."
                    },
                    "minDownloads": {
                        "title": "Min downloads",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip records below this download count."
                    },
                    "minLikes": {
                        "title": "Min likes",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip records below this likes count."
                    },
                    "sort": {
                        "title": "Sort by",
                        "enum": [
                            "trendingScore",
                            "downloads",
                            "likes",
                            "lastModified",
                            "createdAt"
                        ],
                        "type": "string",
                        "description": "Sort criterion for the result list.",
                        "default": "trendingScore"
                    },
                    "maxResults": {
                        "title": "Max results",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Hard cap on records to return.",
                        "default": 100
                    },
                    "parseTagsStructured": {
                        "title": "Parse tags into structured fields",
                        "type": "boolean",
                        "description": "If true, splits the flat tags array into: license, languages, datasetsUsed, arxivPapers, region, hardwareCompatible, frameworks.",
                        "default": true
                    },
                    "includeAuthorProfile": {
                        "title": "Include author profile",
                        "type": "boolean",
                        "description": "For each record, fetch the author/org overview (followers, isPro, numModels, orgs).",
                        "default": false
                    },
                    "enrichWithGoogle": {
                        "title": "Enrich author with web search",
                        "type": "boolean",
                        "description": "For each unique author, find their personal/company website, LinkedIn, Facebook, and secondary emails using a web search (no API key required).",
                        "default": false
                    },
                    "enrichLimit": {
                        "title": "Enrich limit",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Max number of UNIQUE authors to enrich (not records).",
                        "default": 50
                    },
                    "proxyConfig": {
                        "title": "Proxy",
                        "type": "object",
                        "description": "Apify proxy. Used for enrichment only.",
                        "default": {
                            "useApifyProxy": true,
                            "apifyProxyGroups": [
                                "RESIDENTIAL"
                            ]
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
