Huggingface Model Page Scraper
Pricing
from $0.00005 / actor start
Huggingface Model Page Scraper
Scrapes metadata and model card text from Hugging Face model pages.
Pricing
from $0.00005 / actor start
Rating
0.0
(0)
Developer

Akash Kumar Naik
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Hugging Face Model Page Scraper
Scrape Hugging Face model pages and return structured model metadata for analytics, monitoring, and downstream pipelines.
What This Actor Does
This Actor collects metadata from Hugging Face model pages such as:
- model ID, title, and author
- likes and download counters
- pipeline tag, library, license, languages, tags
- creation/last-modified timestamps
- model card text (optional)
It uses CheerioCrawler for fast HTTP scraping and reads model metadata from the page's ModelHeader payload.
Who It Is For
- ML platform teams tracking model popularity and changes
- data engineers building model catalogs
- research teams collecting model metadata at scale
Key Features
- Accepts full model URLs,
owner/modelIDs, and listing URLs likehttps://huggingface.co/models - Deduplicates model inputs automatically
- Optional model card text extraction
- Emits structured dataset output for each model
- Stores failures as output rows with an
errorfield
Typical Use Cases
- Daily monitoring of selected Hugging Face models
- Building internal dashboards for likes/download trends
- Enriching model registry records with tags and license data
Input Parameters
Provide at least one URL in startUrls.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
startUrls | array | Yes | [{ "url": "https://huggingface.co/models" }] | Model page URLs and/or listing URLs. |
maxItems | integer | No | 30 | Maximum model pages to scrape. |
includeModelCardText | boolean | No | false | Include extracted model card text in output. |
Example input:
{"startUrls": [{ "url": "https://huggingface.co/models" },{ "url": "https://huggingface.co/google-bert/bert-base-uncased" }],"maxItems": 30,"includeModelCardText": false}
Output Format
Each dataset item includes fields such as:
{"url": "https://huggingface.co/google-bert/bert-base-uncased","canonicalUrl": "https://huggingface.co/google-bert/bert-base-uncased","modelId": "google-bert/bert-base-uncased","title": "google-bert/bert-base-uncased · Hugging Face","author": "google-bert","likes": 2577,"downloads": 60873623,"downloadsAllTime": 2734678037,"pipelineTag": "fill-mask","libraryName": "transformers","license": "apache-2.0","languages": ["en"],"tags": ["transformers", "pytorch"],"createdAt": "2022-03-02T23:29:04.000Z","lastModified": "2024-02-19T11:06:12.000Z","private": false,"gated": false,"cardExists": true,"inference": "warm","statusCode": 200,"modelCardText": null,"scrapedAt": "2026-02-28T00:00:00.000Z","error": null}
Edge cases:
- Invalid inputs are skipped with warnings in logs.
- Request failures are still pushed to dataset with
errorpopulated. modelCardTextcan benullwhen disabled or unavailable.
Quick Start
Run Locally
- Install dependencies:
$npm install
- Create input file:
storage/key_value_stores/default/INPUT.json
- Run the Actor:
$apify run
Deploy To Apify
$apify push
Pricing Expectations
This Actor supports Pay Per Event charging via Actor.charge(...) per successful model scrape when:
- Actor monetization is set to PAY_PER_EVENT on Apify.
- Event name
resultexists in Actor monetization settings.
If the actor is not on PAY_PER_EVENT pricing, it continues scraping without charging.
Project PPE reference file: .actor/pay-per-event.json.
To reduce cost:
- lower
maxItems - lower
maxConcurrency - increase
requestDelayMswhen large batches are not urgent
FAQ
Why did I get no results?
Make sure at least one valid model input exists in modelUrls or modelIds, for example google-bert/bert-base-uncased.
Do I need a browser crawler?
No. This Actor uses HTTP + Cheerio for speed because target metadata is available in server-rendered HTML payloads.
Why do I see rows with error?
Failed requests are intentionally emitted so batch runs keep a complete audit trail of success/failure per model.
Can I scrape private or gated models?
Public page metadata is scraped. Access-restricted content may be limited depending on model gating and authentication state.
Legal And Compliance
- Respect Hugging Face Terms of Service and robots policy.
- Only scrape data you are allowed to collect and store.
- Review licensing fields (
license) before downstream usage of model artifacts.
Support
For issues or enhancements, open an issue in your project repository or update this Actor code directly in this project.
Related Links
- Apify Actors docs: https://docs.apify.com/platform/actors/development
- Apify SDK for JavaScript: https://docs.apify.com/sdk/js
- Crawlee docs: https://crawlee.dev