# HuggingFace Hub Scraper (`devilscrapes/huggingface-hub-scraper`) Actor Export models, datasets, and Spaces from HuggingFace Hub. Filter by task, library, or author. Trending snapshot mode. No login needed. Richer schema than incumbents. - **URL**: https://apify.com/devilscrapes/huggingface-hub-scraper.md - **Developed by:** [DevilScrapes](https://apify.com/devilscrapes) (community) - **Categories:** AI, Developer tools - **Stats:** 2 total users, 0 monthly users, 100.0% runs succeeded, NaN bookmarks - **User rating**: No ratings yet ## Pricing Pay per event This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events. Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event ## What's an Apify Actor? Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases. In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours, and optionally produces a well-defined JSON output, datasets with results, or files in key-value store. In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server. Actors are written with capital "A". ## How to integrate an Actor? If asked about integration, you help developers integrate Actors into their projects. You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready. The best way to integrate Actors is as follows. In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md): ```bash npm install apify-client ``` In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md): ```bash pip install apify-client ``` In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md): ````bash # MacOS / Linux curl -fsSL https://apify.com/install-cli.sh | bash # Windows irm https://apify.com/install-cli.ps1 | iex ```bash In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md). If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md). For usage examples, see the [API](#api) section below. For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt). # README

## HuggingFace Hub Scraper _We do the dirty work so your dataset stays clean._ 😈 **$2.05 / 1,000 rows in list mode — $5.05 / 1,000 rows in detail mode** — Export structured metadata for models, datasets, and Spaces from the HuggingFace Hub via the public REST API. One Actor handles all three repo types via a `repoType` selector. No login. No API key. No browser automation. This Actor calls HuggingFace's public REST API at `https://huggingface.co/api/`, paginates the list endpoint, optionally enriches each row with the per-repo detail endpoint (safetensors parameter count, GGUF file detection, Space runtime stage), and emits a flat Pydantic-validated dataset ready for direct analysis in spreadsheets, BI tools, or SQL. ### 🎯 What this scrapes Three repo types, one Actor: 1. **Models** — every model on the Hub, with downloads, likes, pipeline tag, library name, tags, and optional safetensors parameter count + GGUF detection. 2. **Datasets** — every dataset on the Hub, with task categories, size categories, and language codes parsed out of tag prefixes. 3. **Spaces** — every Space on the Hub, with SDK name and optional runtime stage (RUNNING / SLEEPING / STOPPED) from detail mode. Four filtering modes (mutually exclusive — at most one may be set): - **Trending snapshot** — leave all filters blank to capture the top-N trending repos by `downloads`, `likes`, `trending`, `last_modified`, or `created_at`. - **Tag filter** — pass `filterTags` to restrict to repos carrying specific tags (e.g. `text-generation`). - **Search query** — pass `searchQuery` for free-text search across repo metadata. - **Author** — pass `author` for every public repo from one org or user (e.g. `openai`). - **Single-repo deep fetch** — pass `repoId` in `owner/name` form to call only the detail endpoint and emit one richly-enriched row. | Field | Type | Description | |---|---|---| | `repo_type` | string | One of `model`, `dataset`, `space` | | `repo_id` | string | HuggingFace repo identifier in `owner/name` form | | `repo_owner` | string \| null | Owning org or user slug | | `repo_name` | string | Last segment of `repo_id` | | `repo_url` | string | `https://huggingface.co/{repo_id}` | | `downloads` | integer \| null | 30-day rolling download count (always null for Spaces) | | `likes` | integer | Like count | | `created_at` | string | ISO 8601 creation timestamp | | `last_modified` | string \| null | ISO 8601 last-modified timestamp | | `tags` | array | Repo tags (may be empty) | | `gated` | boolean \| null | `manual` -> true, false -> false, absent -> null | | `private` | boolean | Always false for public catalog entries | | `pipeline_tag` | string \| null | Model pipeline tag (models only) | | `library_name` | string \| null | Model library name (models only) | | `safetensors_total_params` | integer \| null | Total safetensors parameter count (detail mode, models only) | | `model_size_category` | string \| null | Bucket label `<1B` / `1B-7B` / `7B-13B` / `13B+` | | `has_gguf` | boolean \| null | True if any sibling filename ends `.gguf` (detail mode) | | `dataset_size_categories` | array \| null | Stripped from `size_categories:` tag prefix (datasets only) | | `dataset_task_categories` | array \| null | Stripped from `task_categories:` tag prefix (datasets only) | | `dataset_languages` | array \| null | Stripped from `language:` tag prefix (datasets only) | | `space_sdk` | string \| null | Space SDK (`gradio`, `streamlit`, `docker`, `static`) | | `space_runtime_stage` | string \| null | Space runtime stage (detail mode, Spaces only) | | `scraped_at` | string | ISO 8601 UTC datetime this row was written | ### 🔥 Features - No HuggingFace account or API token required — uses the public unauthenticated Hub API. - Three repo types in one Actor: `model`, `dataset`, `space` — pick via the `repoType` selector. - Five sort fields: `downloads`, `likes`, `trending`, `last_modified`, `created_at` (all descending). - Five filter modes: tag list, free-text search, author/org, single-repo deep fetch, or no filter at all (trending snapshot). - Optional `includeDetails` mode — calls the per-repo detail endpoint to enrich with safetensors parameter counts, GGUF file detection, and Space runtime stage. - GGUF detection flag and derived `model_size_category` bucket (<1B / 1B-7B / 7B-13B / 13B+) ready for downstream pricing or hardware-fit dashboards. - Dataset tag prefixes auto-parsed into structured arrays: `size_categories:`, `task_categories:`, `language:`. - Pydantic v2 input validation — at most one filter may be set; invalid input fails fast with a clear error before any network call. - Exponential backoff on `429` and `503`; honours `Retry-After`; max 5 attempts; respects HuggingFace's 500-req / 5-min rate limit. - Pure HTTP client (`curl-cffi` with browser fingerprint impersonation) — no browser automation, low compute footprint. - Companion to `llm-pricing-monitor` and the planned LMSys leaderboard scraper as the AI Stack Intelligence suite. ### 💡 Use cases - **AI researcher trend tracking** — pull the trending top-100 models weekly and feed a time-series dashboard tracking which model families dominate the Hub. - **Investor adoption monitoring** — measure download velocity for specific model families (`pipeline_tag=text-generation` + `library_name=transformers`) to inform AI infrastructure investment theses. - **Fine-tuner catalog survey** — enumerate every model under a `pipeline_tag` like `image-segmentation` to map the open-weights landscape before choosing a base model. - **Dataset discovery** — filter datasets by `task_categories` and `language` to find labelled training corpora for a downstream NLP/vision/audio model. - **Hardware-fit analysis** — use `safetensors_total_params` and `model_size_category` to filter models that fit a target memory budget before benchmarking. - **GGUF availability tracking** — set `includeDetails=true` and filter on `has_gguf=true` to find quantized inference-ready models for llama.cpp or LM Studio. - **Space monitoring** — capture which Spaces are `RUNNING` vs `SLEEPING` for a creator or topic, useful for community health dashboards. - **Content creator coverage** — feed every model from a popular org like `openai` or `meta-llama` into a content pipeline for blog posts or YouTube videos. ### ⚙️ How to use it 1. Open the Actor input form. 2. Pick **Repo type** — `model`, `dataset`, or `space`. 3. Optionally pick a **Sort field** (default `downloads`). 4. Set **at most one** filter: `filterTags`, `searchQuery`, `author`, or `repoId`. Leave all four blank for a trending snapshot. 5. Adjust **Max results** (default 100, maximum 5000). Ignored in `repoId` mode (always 1 row). 6. Toggle **Include detail enrichment** on if you want safetensors parameter counts, GGUF detection, or Space runtime stage. Detail mode is charged at $0.005/row instead of $0.002/row. 7. Leave **Use Apify Proxy** off unless you are behind a restrictive ISP — the HuggingFace API does not block datacenter IPs. 8. Click **Start** and watch the run log. Results stream into the default dataset and can be downloaded as JSON, CSV, Excel, or XML via the **Export** button. #### Quick examples **Trending top-100 models, list mode:** ```json { "repoType": "model", "sort": "downloads", "maxResults": 100, "includeDetails": false } ```` **Every transformers text-generation model, with safetensors and GGUF detection:** ```json { "repoType": "model", "filterTags": ["text-generation", "transformers"], "maxResults": 500, "includeDetails": true } ``` **Single-repo deep fetch:** ```json { "repoType": "model", "repoId": "openai/whisper-large-v3" } ``` ### 📥 Input | Field | Type | Required | Default | Description | |---|---|---|---|---| | `repoType` | string | yes | — | `model`, `dataset`, or `space` | | `sort` | string | no | `downloads` | One of `downloads`, `likes`, `last_modified`, `created_at`, `trending` | | `filterTags` | array | one-of | — | Tag filter; joined with comma for HF `filter=` param | | `searchQuery` | string | one-of | — | Free-text search via HF `search=` | | `author` | string | one-of | — | Org or user slug via HF `author=` | | `repoId` | string | one-of | — | Single `owner/name` deep-fetch; forces detail mode | | `maxResults` | integer | no | `100` | Max rows emitted (1 to 5000) | | `includeDetails` | boolean | no | `false` | Per-row detail enrichment | | `useProxy` | boolean | no | `false` | Route via Apify Proxy (BUYPROXIES94952) | At most one of `filterTags`, `searchQuery`, `author`, `repoId` may be non-null. Setting two or more raises a Pydantic validation error before any network call. ### 📤 Output One row per repo. All fields documented in the field table above; type-specific fields are `null` for repo types where they don't apply. ```json { "repo_type": "model", "repo_id": "openai/whisper-large-v3", "repo_owner": "openai", "repo_name": "whisper-large-v3", "repo_url": "https://huggingface.co/openai/whisper-large-v3", "downloads": 4932732, "likes": 5690, "created_at": "2023-11-07T18:41:14.000Z", "last_modified": "2024-08-12T10:20:10.000Z", "tags": ["transformers", "safetensors", "whisper", "automatic-speech-recognition"], "gated": false, "private": false, "pipeline_tag": "automatic-speech-recognition", "library_name": "transformers", "safetensors_total_params": 1543490560, "model_size_category": "1B-7B", "has_gguf": false, "dataset_size_categories": null, "dataset_task_categories": null, "dataset_languages": null, "space_sdk": null, "space_runtime_stage": null, "scraped_at": "2026-05-16T12:00:00+00:00" } ``` Optional fields (`repo_owner`, `downloads`, `last_modified`, `gated`, `pipeline_tag`, `library_name`, `safetensors_total_params`, `model_size_category`, `has_gguf`, the dataset\_\* arrays, `space_sdk`, `space_runtime_stage`) are emitted as `null` when the API does not return them. Rows are never dropped for missing optional fields. #### Export formats After a run completes, click **Export** in the Apify Console to download: - **JSON** — full fidelity, all fields, newline-delimited - **CSV** — flat, one row per repo - **Excel** — `.xlsx` via the Apify dataset converter - **XML** — structured per-item All formats are available via the Apify API: `GET /datasets/{id}/items?format=csv&clean=true`. ### 💰 Pricing Pay-Per-Event (PPE) — you pay only for what you use: | Event | Price (USD) | When | |---|---|---| | `actor-start` | $0.05 | Once per run, at boot | | `result-row` | $0.002 | Per repo row written in list mode (`includeDetails=false`) | | `result-row-detailed` | $0.005 | Per repo row written in detail mode or `repoId` mode | #### Example costs | Rows scraped | Mode | Actor starts | Total cost | |---|---|---|---| | 100 | list | 1 | $0.25 | | 500 | list | 1 | $1.05 | | 1,000 | list | 1 | $2.05 | | 1,000 | detail | 1 | $5.05 | | 5,000 | list | 1 | $10.05 | | 5,000 | detail | 1 | $25.05 | This rate is consistent with the `llm-pricing-monitor` companion Actor so the AI Stack Intelligence suite has uniform pricing. ### 🚧 Limitations - **Private and gated repos are not accessible.** The unauthenticated public API only returns publicly visible data. - **Rate limit: 500 requests per 5 minutes** (verified 2026-05-16 via `ratelimit-policy` header). At default page size 100, this allows ~10,000 list rows per 5-min window. Detail mode adds 1 request per row, halving throughput to ~250 enriched rows/minute. - **Spaces never have a `downloads` metric.** The field is always `null` for `repo_type=space` — verified both list and detail endpoints. - **Sparse Spaces list:** `repo_owner`, `last_modified`, and `space_runtime_stage` require detail mode for Spaces. - **Sparse model list without detail mode:** while `full=true` is always sent, `safetensors_total_params`, `model_size_category`, and `has_gguf` are only populated in detail mode. - **No cross-run deduplication.** Re-running the same input returns the same repos with refreshed metadata. Use a downstream dedupe pass if you need uniqueness across runs. - **No model card or dataset card markdown content.** Only structured metadata fields; the README body is excluded as too noisy for a structured dataset. - **No HuggingFace Inference API calls or model benchmarking.** This Actor only scrapes catalog metadata, not model outputs. - **The Apify FREE tier retains run-scoped storage for 7 days only.** For longer retention, export your dataset immediately or upgrade to a paid Apify plan. ### Tips for best results - **Use a trending snapshot weekly** to track the rapidly-evolving model leaderboard. Set up an Apify Schedule for a recurring run. - **Cap `maxResults` to what you actually need.** The HF Hub has 1M+ models; setting a sensible cap keeps cost and runtime predictable. - **Use detail mode sparingly.** It is 2.5x the per-row cost and 4x the per-row latency. Prefer list mode for catalog snapshots; flip to detail mode only when you need safetensors / GGUF / runtime. - **Combine with `llm-pricing-monitor`** as the AI Stack Intelligence suite to correlate open-weights releases with hosted-API price moves. ### Integrations This Actor works natively with the Apify platform's built-in connectors: - **Apify API** — trigger runs programmatically, poll for status, and fetch dataset items via REST. Full OpenAPI spec at `https://docs.apify.com/api/v2`. - **Webhooks** — configure a webhook to POST the run result to your endpoint as soon as the Actor finishes. - **Apify Schedules** — run this Actor on a cron schedule to keep a trending leaderboard dataset fresh. - **Make (formerly Integromat)** — use the Apify Make module to trigger runs and route results to Google Sheets, Airtable, Slack, or anywhere Make connects. - **Zapier** — Apify's Zapier integration triggers on run completion and passes dataset items downstream. - **n8n** — use the HTTP Request node with the Apify REST API for fully self-hosted automation pipelines. ### ❓ FAQ **Do I need a HuggingFace account?** No. The HuggingFace Hub public API at `huggingface.co/api/` is unauthenticated by design for read access to public repos. No account, no API token, no rate-limit credentials needed. **What is the difference between list mode and detail mode?** List mode (`includeDetails=false`, $0.002/row) makes one API request per page of 100 rows. You get repo\_id, owner, downloads, likes, tags, pipeline\_tag, library\_name, gated flag, and timestamps. Detail mode (`includeDetails=true`, $0.005/row) additionally makes one request per row to fetch safetensors parameter counts, GGUF file detection, Space runtime stage, and dataset cardData. Use detail mode when you need any of those enriched fields; otherwise stick to list mode for 2.5x cheaper rows and faster throughput. **How do I get trending models?** Leave all four filter fields (`filterTags`, `searchQuery`, `author`, `repoId`) blank, keep the default `sort=downloads`, and set `maxResults` to the top-N you want (e.g. 100). The Actor returns the most-downloaded models on the Hub, descending. To use the HF "trending" score instead, set `sort=trending`. **Can I scrape datasets and Spaces too?** Yes. Switch the `repoType` input to `dataset` or `space` and run again. The same filtering, pagination, and detail-mode features apply across all three repo types. Note that Spaces never have a `downloads` count and require detail mode for `repo_owner` / `last_modified` / `runtime_stage`. **Why are some fields null on my rows?** HuggingFace's list endpoint returns different field sets for different repo types. For example, model list rows lack `repo_owner` unless `full=true` is sent (this Actor always sends it). Space list rows lack `repo_owner`, `last_modified`, and `runtime` entirely — you must use detail mode to populate them. Dataset list rows are the richest by default. The `null` values are accurate, not bugs. **What is GGUF and why detect it?** GGUF (GPT-Generated Unified Format) is the quantized model file format used by llama.cpp, LM Studio, and Ollama for local CPU inference. When a model repo includes a `.gguf` sibling file, it can run on consumer hardware without a GPU. Set `includeDetails=true` and filter your dataset on `has_gguf=true` to find inference-ready open-weights models. **Why is `useProxy` off by default?** The HuggingFace Hub API does not block datacenter IPs, so direct routing is faster and free. Enable proxy only if you are behind a restrictive ISP or firewall. **Is scraping the HuggingFace Hub legal?** The HuggingFace public API at `huggingface.co/api/` is unauthenticated and explicitly designed for programmatic access to public repo metadata. The HuggingFace Terms of Service permit accessing public data via the API. This Actor never bypasses authentication, never accesses gated content, and never submits content — it only reads public repository metadata. Always verify the current Terms of Service at `huggingface.co/terms-of-service` and your local jurisdiction's data-protection rules before using scraped data for commercial purposes. ### Related Actors - **`llm-pricing-monitor`** (planned) — companion in the AI Stack Intelligence suite. Tracks per-token pricing for hosted LLM APIs (OpenAI, Anthropic, Google, Mistral) so you can correlate open-weights model releases on the HF Hub with hosted-API price moves. ### 💬 Your feedback Found a bug, hit a rate limit, or need a new field on the output row? Open an issue on the Actor's Apify Store page or contact the Devil Scrapes team at [apify.com/DevilScrapes](https://apify.com/DevilScrapes). We ship updates within days of validated reports. # Actor input Schema ## `repoType` (type: `string`): Which HuggingFace Hub repo type to scrape: model, dataset, or space. ## `sort` (type: `string`): Sort the list endpoint by this field. trending maps to the HF trendingScore param; last\_modified → lastModified; created\_at → createdAt. ## `filterTags` (type: `array`): Optional tag filter — joined with comma for the HF filter= query param. Example: \["text-generation", "transformers"]. Mutually exclusive with the other three filter inputs. ## `searchQuery` (type: `string`): Optional free-text search via the HF search= query param. Mutually exclusive with the other three filter inputs. ## `author` (type: `string`): Optional org or user slug via the HF author= query param (e.g. openai). Mutually exclusive with the other three filter inputs. ## `repoId` (type: `string`): Optional single-repo deep-fetch in owner/name format. Skips the list endpoint and calls the per-repo detail endpoint directly. Forces detail mode. Mutually exclusive with the other three filter inputs. ## `maxResults` (type: `integer`): Maximum number of rows emitted. Ignored in repoId mode (always 1 row). ## `includeDetails` (type: `boolean`): If enabled, calls the per-repo detail endpoint for each row to enrich safetensors / siblings / runtime fields. Charged as result-row-detailed ($0.005/row). ## `useProxy` (type: `boolean`): Route requests through Apify Proxy (BUYPROXIES94952). The HuggingFace Hub API does not block datacenter IPs — leave disabled unless you are behind a restrictive ISP. ## Actor input object example ```json { "repoType": "model", "sort": "downloads", "maxResults": 100, "includeDetails": false, "useProxy": false } ``` # Actor output Schema ## `datasetItems` (type: `string`): All dataset items as JSON. ## `datasetItemsCsv` (type: `string`): Same data exported to CSV. ## `datasetView` (type: `string`): Open the run dataset in the Console. # API You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup. ## JavaScript example ```javascript import { ApifyClient } from 'apify-client'; // Initialize the ApifyClient with your Apify API token // Replace the '' with your token const client = new ApifyClient({ token: '', }); // Prepare Actor input const input = { "repoType": "model" }; // Run the Actor and wait for it to finish const run = await client.actor("devilscrapes/huggingface-hub-scraper").call(input); // Fetch and print Actor results from the run's dataset (if any) console.log('Results from dataset'); console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`); const { items } = await client.dataset(run.defaultDatasetId).listItems(); items.forEach((item) => { console.dir(item); }); // 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs ``` ## Python example ```python from apify_client import ApifyClient # Initialize the ApifyClient with your Apify API token # Replace '' with your token. client = ApifyClient("") # Prepare the Actor input run_input = { "repoType": "model" } # Run the Actor and wait for it to finish run = client.actor("devilscrapes/huggingface-hub-scraper").call(run_input=run_input) # Fetch and print Actor results from the run's dataset (if there are any) print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"]) for item in client.dataset(run["defaultDatasetId"]).iterate_items(): print(item) # 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start ``` ## CLI example ```bash echo '{ "repoType": "model" }' | apify call devilscrapes/huggingface-hub-scraper --silent --output-dataset ``` ## MCP server setup ```json { "mcpServers": { "apify": { "command": "npx", "args": [ "mcp-remote", "https://mcp.apify.com/?tools=devilscrapes/huggingface-hub-scraper", "--header", "Authorization: Bearer " ] } } } ``` ## OpenAPI specification ```json { "openapi": "3.0.1", "info": { "title": "HuggingFace Hub Scraper", "description": "Export models, datasets, and Spaces from HuggingFace Hub. Filter by task, library, or author. Trending snapshot mode. No login needed. Richer schema than incumbents.", "version": "0.2", "x-build-id": "wovGWW13KRPB8ommC" }, "servers": [ { "url": "https://api.apify.com/v2" } ], "paths": { "/acts/devilscrapes~huggingface-hub-scraper/run-sync-get-dataset-items": { "post": { "operationId": "run-sync-get-dataset-items-devilscrapes-huggingface-hub-scraper", "x-openai-isConsequential": false, "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK" } } } }, "/acts/devilscrapes~huggingface-hub-scraper/runs": { "post": { "operationId": "runs-sync-devilscrapes-huggingface-hub-scraper", "x-openai-isConsequential": false, "summary": "Executes an Actor and returns information about the initiated run in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/runsResponseSchema" } } } } } } }, "/acts/devilscrapes~huggingface-hub-scraper/run-sync": { "post": { "operationId": "run-sync-devilscrapes-huggingface-hub-scraper", "x-openai-isConsequential": false, "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK" } } } } }, "components": { "schemas": { "inputSchema": { "type": "object", "required": [ "repoType" ], "properties": { "repoType": { "title": "Repo type", "enum": [ "model", "dataset", "space" ], "type": "string", "description": "Which HuggingFace Hub repo type to scrape: model, dataset, or space.", "default": "model" }, "sort": { "title": "Sort field", "enum": [ "downloads", "likes", "last_modified", "created_at", "trending" ], "type": "string", "description": "Sort the list endpoint by this field. trending maps to the HF trendingScore param; last_modified → lastModified; created_at → createdAt.", "default": "downloads" }, "filterTags": { "title": "Filter tags", "type": "array", "description": "Optional tag filter — joined with comma for the HF filter= query param. Example: [\"text-generation\", \"transformers\"]. Mutually exclusive with the other three filter inputs.", "items": { "type": "string" } }, "searchQuery": { "title": "Search query", "type": "string", "description": "Optional free-text search via the HF search= query param. Mutually exclusive with the other three filter inputs." }, "author": { "title": "Author", "type": "string", "description": "Optional org or user slug via the HF author= query param (e.g. openai). Mutually exclusive with the other three filter inputs." }, "repoId": { "title": "Repo ID", "type": "string", "description": "Optional single-repo deep-fetch in owner/name format. Skips the list endpoint and calls the per-repo detail endpoint directly. Forces detail mode. Mutually exclusive with the other three filter inputs." }, "maxResults": { "title": "Max results", "minimum": 1, "maximum": 5000, "type": "integer", "description": "Maximum number of rows emitted. Ignored in repoId mode (always 1 row).", "default": 100 }, "includeDetails": { "title": "Include detail enrichment", "type": "boolean", "description": "If enabled, calls the per-repo detail endpoint for each row to enrich safetensors / siblings / runtime fields. Charged as result-row-detailed ($0.005/row).", "default": false }, "useProxy": { "title": "Use Apify Proxy", "type": "boolean", "description": "Route requests through Apify Proxy (BUYPROXIES94952). The HuggingFace Hub API does not block datacenter IPs — leave disabled unless you are behind a restrictive ISP.", "default": false } } }, "runsResponseSchema": { "type": "object", "properties": { "data": { "type": "object", "properties": { "id": { "type": "string" }, "actId": { "type": "string" }, "userId": { "type": "string" }, "startedAt": { "type": "string", "format": "date-time", "example": "2025-01-08T00:00:00.000Z" }, "finishedAt": { "type": "string", "format": "date-time", "example": "2025-01-08T00:00:00.000Z" }, "status": { "type": "string", "example": "READY" }, "meta": { "type": "object", "properties": { "origin": { "type": "string", "example": "API" }, "userAgent": { "type": "string" } } }, "stats": { "type": "object", "properties": { "inputBodyLen": { "type": "integer", "example": 2000 }, "rebootCount": { "type": "integer", "example": 0 }, "restartCount": { "type": "integer", "example": 0 }, "resurrectCount": { "type": "integer", "example": 0 }, "computeUnits": { "type": "integer", "example": 0 } } }, "options": { "type": "object", "properties": { "build": { "type": "string", "example": "latest" }, "timeoutSecs": { "type": "integer", "example": 300 }, "memoryMbytes": { "type": "integer", "example": 1024 }, "diskMbytes": { "type": "integer", "example": 2048 } } }, "buildId": { "type": "string" }, "defaultKeyValueStoreId": { "type": "string" }, "defaultDatasetId": { "type": "string" }, "defaultRequestQueueId": { "type": "string" }, "buildNumber": { "type": "string", "example": "1.0.0" }, "containerUrl": { "type": "string" }, "usage": { "type": "object", "properties": { "ACTOR_COMPUTE_UNITS": { "type": "integer", "example": 0 }, "DATASET_READS": { "type": "integer", "example": 0 }, "DATASET_WRITES": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_READS": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_WRITES": { "type": "integer", "example": 1 }, "KEY_VALUE_STORE_LISTS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_READS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_WRITES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_INTERNAL_GBYTES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_EXTERNAL_GBYTES": { "type": "integer", "example": 0 }, "PROXY_RESIDENTIAL_TRANSFER_GBYTES": { "type": "integer", "example": 0 }, "PROXY_SERPS": { "type": "integer", "example": 0 } } }, "usageTotalUsd": { "type": "number", "example": 0.00005 }, "usageUsd": { "type": "object", "properties": { "ACTOR_COMPUTE_UNITS": { "type": "integer", "example": 0 }, "DATASET_READS": { "type": "integer", "example": 0 }, "DATASET_WRITES": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_READS": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_WRITES": { "type": "number", "example": 0.00005 }, "KEY_VALUE_STORE_LISTS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_READS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_WRITES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_INTERNAL_GBYTES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_EXTERNAL_GBYTES": { "type": "integer", "example": 0 }, "PROXY_RESIDENTIAL_TRANSFER_GBYTES": { "type": "integer", "example": 0 }, "PROXY_SERPS": { "type": "integer", "example": 0 } } } } } } } } } } ```