Agent Eval Harness Finder
Pricing
Pay per usage
Agent Eval Harness Finder
Catalog open-source agent eval harnesses & benchmarks (SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena). Combines GitHub search + curated seed list, scores by quality signals (stars, recency, license), parses README scope and sample model scores.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Yanlong Mu
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 days ago
Last modified
Categories
Share
What does Agent Eval Harness Finder do?
Agent Eval Harness Finder catalogs open-source agent evaluation harnesses and benchmarks — SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena, lm-evaluation-harness, HELM, RewardBench, and dozens more — and returns a ranked, deduplicated, production-ready inventory of where to find each one, how active it is, and what model scores have been published.
Instead of digging through papers and arxiv to figure out which benchmark to use for your agent framework, you run this Actor and get one structured dataset with repo, stars, benchmarkType, lastUpdated, license, scope, lastPublishedScore, and a 0-100 qualityScore. Sort, filter, export as CSV / JSON / Excel.
This Actor combines a curated seed list of 26+ canonical harness repos with live GitHub search so that newly published benchmarks surface automatically without manual updates.
Part of Ian Mu's 100-Apify-Actor portfolio for the AI / Claude tooling ecosystem (github.com/ianymu). See also: claude-verify-before-stop — a Claude Code hook that enforces real verification before tasks are marked complete.
Why use Agent Eval Harness Finder?
- AI researchers — Find the right benchmark for your agent paper without spending an afternoon on Google Scholar.
- Agent framework builders — Decide which harnesses to wire into your CI / regression suite (SWE-Bench for coding agents? ToolBench for tool-use? WebArena for browser agents?).
- Journalists & analysts — Get a defensible, dated snapshot of "the most active agent benchmarks today" with quality scores you can cite.
- Procurement / due diligence — When evaluating an AI agent vendor's claims ("we score X% on Y benchmark"), this Actor tells you whether Y benchmark is still maintained, who maintains it, and what other models score.
- Trend detection — Schedule a daily run to spot newly-published benchmarks the moment they cross the star-threshold.
How to use Agent Eval Harness Finder
- Open the Actor in Apify Console and click Try Actor.
- (Optional) Filter by benchmark type (
code-fixing,web-agent,tool-use,multi-agent, etc.). Leave empty for everything. - (Optional) Adjust
minStars(default 100) andmaxResults(default 30). - Click Start. Runs in roughly a minute.
- Open the Output tab and download as CSV, JSON, or Excel, or hit the Dataset API endpoint to integrate downstream.
Input
| Field | Type | Default | Description |
|---|---|---|---|
topicFilter | array of strings | [] | Only include harnesses whose inferred type contains one of these strings (case-insensitive). Empty = no filter. |
minStars | integer | 100 | Skip repos below this star count. |
maxResults | integer | 30 | Stop after enriching this many harnesses. |
Example input:
{"topicFilter": ["code-fixing", "tool-use"],"minStars": 200,"maxResults": 20}
Output
Each row is a single harness. You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.
Example output:
{"name": "SWE-bench","repo": "princeton-nlp/SWE-bench","url": "https://github.com/princeton-nlp/SWE-bench","stars": 8200,"language": "Python","benchmarkType": "code-fixing","scope": "Real-world GitHub issues drawn from popular Python repositories. Multi-file fixes required to pass tests.","lastPublishedScore": "Claude 3.5 Sonnet: 49%; GPT-4o: 38%","license": "MIT","lastUpdated": "2026-04-12T18:33:00Z","qualityScore": 88,"source": "curated_seed"}
A human-readable Markdown leaderboard is also written to the key-value store as eval-harness-catalog.md.
Data table
| Field | Type | Description |
|---|---|---|
name | string | Repo name (SWE-bench) |
repo | string | owner/name GitHub identifier |
url | URL | GitHub repository page |
description | string | GitHub description |
stars | integer | Star count |
forks | integer | Fork count |
openIssues | integer | Open-issue count |
language | string | Primary language (usually Python) |
license | string | SPDX license ID (MIT / Apache-2.0 / etc.) |
benchmarkType | string | Inferred type: code-fixing, code-generation, web-agent, tool-use, multi-agent, text-to-sql, reasoning, reward-model, general-agent, lm-general, etc. |
scope | string | null | Parsed from README — what the benchmark actually tests |
lastPublishedScore | string | null | Sample model scores pulled from the README |
lastUpdated | ISO date | Latest commit timestamp |
qualityScore | integer | 0-100 composite (stars / recency / license / docs / activity) |
source | string | curated_seed or search:<query> |
Pricing / Cost estimation
This Actor is cheap to run: ~8 GitHub Search API calls + roughly maxResults repo + README fetches per run. With a GITHUB_TOKEN environment variable set (5,000 req/hr quota), full runs comfortably stay under one minute. Without a token, GitHub limits unauthenticated requests to 60/hr — enough for one full run.
How much does it cost to run agent benchmark discovery? Effectively a fraction of a cent.
Tips or advanced options
- Set
GITHUB_TOKENas an Actor secret to unlock 5,000 req/hr (vs 60 unauthenticated). - Daily trend tracking — Schedule the Actor daily, and diff today's catalog against yesterday's to spot newly published benchmarks.
- Narrow by type — Use
topicFilterto slice the catalog to just web-agent benchmarks, just code-fixing, etc. - Tune
minStars— Drop to 50 to surface emerging benchmarks; raise to 1000 to get only the canonical ones. - Graceful failure — If GitHub search rate-limits, the curated seed list (26+ canonical repos) still produces a usable catalog.
FAQ, disclaimers, and support
Is this legal? Yes — GitHub's REST API is a public, documented interface and this Actor only requests publicly-listed repository metadata. No login required.
Why is benchmark X missing? Likely the repo has fewer than 100 stars or wasn't tagged with an agent-eval-related topic and wasn't in the curated seed list. Open an issue at github.com/ianymu and we'll add it to the seed list.
How accurate is benchmarkType? It's inferred from repo name + description + first 2 KB of README. Heuristic, not authoritative — but consistent enough to filter and group on. For canonical seed repos the type is hand-curated.
How accurate is lastPublishedScore? Best-effort regex extraction from the README. Many harnesses host leaderboards on external sites (e.g. swebench.com); for those, this field will often be null and you should follow the url for live scores.
Custom version? Need this wired into your research / engineering workflow with Slack alerts, per-benchmark deep dives, or paper-citation enrichment? Open an issue and we can build a custom Actor on top.
Built by Ian Mu as part of his 100-Apify-Actor AI tooling portfolio. See the companion repo claude-verify-before-stop.