Huggingface Model Page Scraper avatar

Huggingface Model Page Scraper

Pricing

from $0.00005 / actor start

Go to Apify Store
Huggingface Model Page Scraper

Huggingface Model Page Scraper

Scrapes metadata and model card text from Hugging Face model pages.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Akash Kumar Naik

Akash Kumar Naik

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Hugging Face Model Page Scraper

Scrape Hugging Face model pages and return structured model metadata for analytics, monitoring, and downstream pipelines.

What This Actor Does

This Actor collects metadata from Hugging Face model pages such as:

  • model ID, title, and author
  • likes and download counters
  • pipeline tag, library, license, languages, tags
  • creation/last-modified timestamps
  • model card text (optional)

It uses CheerioCrawler for fast HTTP scraping and reads model metadata from the page's ModelHeader payload.

Who It Is For

  • ML platform teams tracking model popularity and changes
  • data engineers building model catalogs
  • research teams collecting model metadata at scale

Key Features

  • Accepts full model URLs, owner/model IDs, and listing URLs like https://huggingface.co/models
  • Deduplicates model inputs automatically
  • Optional model card text extraction
  • Emits structured dataset output for each model
  • Stores failures as output rows with an error field

Typical Use Cases

  • Daily monitoring of selected Hugging Face models
  • Building internal dashboards for likes/download trends
  • Enriching model registry records with tags and license data

Input Parameters

Provide at least one URL in startUrls.

ParameterTypeRequiredDefaultDescription
startUrlsarrayYes[{ "url": "https://huggingface.co/models" }]Model page URLs and/or listing URLs.
maxItemsintegerNo30Maximum model pages to scrape.
includeModelCardTextbooleanNofalseInclude extracted model card text in output.

Example input:

{
"startUrls": [
{ "url": "https://huggingface.co/models" },
{ "url": "https://huggingface.co/google-bert/bert-base-uncased" }
],
"maxItems": 30,
"includeModelCardText": false
}

Output Format

Each dataset item includes fields such as:

{
"url": "https://huggingface.co/google-bert/bert-base-uncased",
"canonicalUrl": "https://huggingface.co/google-bert/bert-base-uncased",
"modelId": "google-bert/bert-base-uncased",
"title": "google-bert/bert-base-uncased · Hugging Face",
"author": "google-bert",
"likes": 2577,
"downloads": 60873623,
"downloadsAllTime": 2734678037,
"pipelineTag": "fill-mask",
"libraryName": "transformers",
"license": "apache-2.0",
"languages": ["en"],
"tags": ["transformers", "pytorch"],
"createdAt": "2022-03-02T23:29:04.000Z",
"lastModified": "2024-02-19T11:06:12.000Z",
"private": false,
"gated": false,
"cardExists": true,
"inference": "warm",
"statusCode": 200,
"modelCardText": null,
"scrapedAt": "2026-02-28T00:00:00.000Z",
"error": null
}

Edge cases:

  • Invalid inputs are skipped with warnings in logs.
  • Request failures are still pushed to dataset with error populated.
  • modelCardText can be null when disabled or unavailable.

Quick Start

Run Locally

  1. Install dependencies:
$npm install
  1. Create input file:

storage/key_value_stores/default/INPUT.json

  1. Run the Actor:
$apify run

Deploy To Apify

$apify push

Pricing Expectations

This Actor supports Pay Per Event charging via Actor.charge(...) per successful model scrape when:

  • Actor monetization is set to PAY_PER_EVENT on Apify.
  • Event name result exists in Actor monetization settings.

If the actor is not on PAY_PER_EVENT pricing, it continues scraping without charging.

Project PPE reference file: .actor/pay-per-event.json.

To reduce cost:

  • lower maxItems
  • lower maxConcurrency
  • increase requestDelayMs when large batches are not urgent

FAQ

Why did I get no results?

Make sure at least one valid model input exists in modelUrls or modelIds, for example google-bert/bert-base-uncased.

Do I need a browser crawler?

No. This Actor uses HTTP + Cheerio for speed because target metadata is available in server-rendered HTML payloads.

Why do I see rows with error?

Failed requests are intentionally emitted so batch runs keep a complete audit trail of success/failure per model.

Can I scrape private or gated models?

Public page metadata is scraped. Access-restricted content may be limited depending on model gating and authentication state.

  • Respect Hugging Face Terms of Service and robots policy.
  • Only scrape data you are allowed to collect and store.
  • Review licensing fields (license) before downstream usage of model artifacts.

Support

For issues or enhancements, open an issue in your project repository or update this Actor code directly in this project.