Ai Model Benchmark Scraper
Pricing
Pay per usage
Go to Apify Store
Ai Model Benchmark Scraper
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Scrape AI benchmark leaderboards to extract model performance scores and rankings. Supports multiple benchmark sources including Chatbot Arena (LMSYS), MMLU, HumanEval, and MT-Bench leaderboards.
Features
- Multi-benchmark support covering the most popular LLM evaluation frameworks
- Chatbot Arena scraping with ELO ratings from the LMSYS leaderboard
- Model metadata extraction including provider, parameter count, and release date
- Score normalization for cross-benchmark comparison when possible
- Puppeteer-based rendering to handle JavaScript-heavy leaderboard pages
- Configurable benchmark selection to target specific evaluation metrics
Use Cases
- Compare LLM performance across multiple benchmarks before selecting a model
- Track model performance improvements over time with scheduled runs
- Build automated reports on the AI model competitive landscape
- Feed benchmark data into model selection pipelines and evaluation frameworks
- Monitor when new models appear on leaderboards for competitive intelligence
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
benchmarks | array | ["chatbot-arena"] | Benchmarks to scrape |
Output Format
Each model entry produces a dataset item with:
benchmark- Name of the benchmark sourcemodelName- Full model name or identifierscore- Benchmark score or ELO ratingrank- Position on the leaderboardprovider- Organization or company behind the modelparameters- Parameter count when availablescrapedAt- ISO timestamp of extraction
Supported Benchmarks
This actor supports scraping from LMSYS Chatbot Arena, HuggingFace Open LLM Leaderboard, and various benchmark result pages. Additional benchmark sources can be requested.
Limitations
- Some leaderboards use complex React/Gradio rendering that may require multiple attempts
- Benchmark scores and rankings change frequently; schedule regular runs for latest data
- Parameter counts and release dates may not be available for all models