Agent Eval Harness Finder avatar

Agent Eval Harness Finder

Pricing

Pay per usage

Go to Apify Store
Agent Eval Harness Finder

Agent Eval Harness Finder

Catalog open-source agent eval harnesses & benchmarks (SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena). Combines GitHub search + curated seed list, scores by quality signals (stars, recency, license), parses README scope and sample model scores.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Yanlong Mu

Yanlong Mu

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 days ago

Last modified

Share

What does Agent Eval Harness Finder do?

Agent Eval Harness Finder catalogs open-source agent evaluation harnesses and benchmarks — SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena, lm-evaluation-harness, HELM, RewardBench, and dozens more — and returns a ranked, deduplicated, production-ready inventory of where to find each one, how active it is, and what model scores have been published.

Instead of digging through papers and arxiv to figure out which benchmark to use for your agent framework, you run this Actor and get one structured dataset with repo, stars, benchmarkType, lastUpdated, license, scope, lastPublishedScore, and a 0-100 qualityScore. Sort, filter, export as CSV / JSON / Excel.

This Actor combines a curated seed list of 26+ canonical harness repos with live GitHub search so that newly published benchmarks surface automatically without manual updates.

Part of Ian Mu's 100-Apify-Actor portfolio for the AI / Claude tooling ecosystem (github.com/ianymu). See also: claude-verify-before-stop — a Claude Code hook that enforces real verification before tasks are marked complete.

Why use Agent Eval Harness Finder?

  • AI researchers — Find the right benchmark for your agent paper without spending an afternoon on Google Scholar.
  • Agent framework builders — Decide which harnesses to wire into your CI / regression suite (SWE-Bench for coding agents? ToolBench for tool-use? WebArena for browser agents?).
  • Journalists & analysts — Get a defensible, dated snapshot of "the most active agent benchmarks today" with quality scores you can cite.
  • Procurement / due diligence — When evaluating an AI agent vendor's claims ("we score X% on Y benchmark"), this Actor tells you whether Y benchmark is still maintained, who maintains it, and what other models score.
  • Trend detection — Schedule a daily run to spot newly-published benchmarks the moment they cross the star-threshold.

How to use Agent Eval Harness Finder

  1. Open the Actor in Apify Console and click Try Actor.
  2. (Optional) Filter by benchmark type (code-fixing, web-agent, tool-use, multi-agent, etc.). Leave empty for everything.
  3. (Optional) Adjust minStars (default 100) and maxResults (default 30).
  4. Click Start. Runs in roughly a minute.
  5. Open the Output tab and download as CSV, JSON, or Excel, or hit the Dataset API endpoint to integrate downstream.

Input

FieldTypeDefaultDescription
topicFilterarray of strings[]Only include harnesses whose inferred type contains one of these strings (case-insensitive). Empty = no filter.
minStarsinteger100Skip repos below this star count.
maxResultsinteger30Stop after enriching this many harnesses.

Example input:

{
"topicFilter": ["code-fixing", "tool-use"],
"minStars": 200,
"maxResults": 20
}

Output

Each row is a single harness. You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.

Example output:

{
"name": "SWE-bench",
"repo": "princeton-nlp/SWE-bench",
"url": "https://github.com/princeton-nlp/SWE-bench",
"stars": 8200,
"language": "Python",
"benchmarkType": "code-fixing",
"scope": "Real-world GitHub issues drawn from popular Python repositories. Multi-file fixes required to pass tests.",
"lastPublishedScore": "Claude 3.5 Sonnet: 49%; GPT-4o: 38%",
"license": "MIT",
"lastUpdated": "2026-04-12T18:33:00Z",
"qualityScore": 88,
"source": "curated_seed"
}

A human-readable Markdown leaderboard is also written to the key-value store as eval-harness-catalog.md.

Data table

FieldTypeDescription
namestringRepo name (SWE-bench)
repostringowner/name GitHub identifier
urlURLGitHub repository page
descriptionstringGitHub description
starsintegerStar count
forksintegerFork count
openIssuesintegerOpen-issue count
languagestringPrimary language (usually Python)
licensestringSPDX license ID (MIT / Apache-2.0 / etc.)
benchmarkTypestringInferred type: code-fixing, code-generation, web-agent, tool-use, multi-agent, text-to-sql, reasoning, reward-model, general-agent, lm-general, etc.
scopestring | nullParsed from README — what the benchmark actually tests
lastPublishedScorestring | nullSample model scores pulled from the README
lastUpdatedISO dateLatest commit timestamp
qualityScoreinteger0-100 composite (stars / recency / license / docs / activity)
sourcestringcurated_seed or search:<query>

Pricing / Cost estimation

This Actor is cheap to run: ~8 GitHub Search API calls + roughly maxResults repo + README fetches per run. With a GITHUB_TOKEN environment variable set (5,000 req/hr quota), full runs comfortably stay under one minute. Without a token, GitHub limits unauthenticated requests to 60/hr — enough for one full run.

How much does it cost to run agent benchmark discovery? Effectively a fraction of a cent.

Tips or advanced options

  • Set GITHUB_TOKEN as an Actor secret to unlock 5,000 req/hr (vs 60 unauthenticated).
  • Daily trend tracking — Schedule the Actor daily, and diff today's catalog against yesterday's to spot newly published benchmarks.
  • Narrow by type — Use topicFilter to slice the catalog to just web-agent benchmarks, just code-fixing, etc.
  • Tune minStars — Drop to 50 to surface emerging benchmarks; raise to 1000 to get only the canonical ones.
  • Graceful failure — If GitHub search rate-limits, the curated seed list (26+ canonical repos) still produces a usable catalog.

FAQ, disclaimers, and support

Is this legal? Yes — GitHub's REST API is a public, documented interface and this Actor only requests publicly-listed repository metadata. No login required.

Why is benchmark X missing? Likely the repo has fewer than 100 stars or wasn't tagged with an agent-eval-related topic and wasn't in the curated seed list. Open an issue at github.com/ianymu and we'll add it to the seed list.

How accurate is benchmarkType? It's inferred from repo name + description + first 2 KB of README. Heuristic, not authoritative — but consistent enough to filter and group on. For canonical seed repos the type is hand-curated.

How accurate is lastPublishedScore? Best-effort regex extraction from the README. Many harnesses host leaderboards on external sites (e.g. swebench.com); for those, this field will often be null and you should follow the url for live scores.

Custom version? Need this wired into your research / engineering workflow with Slack alerts, per-benchmark deep dives, or paper-citation enrichment? Open an issue and we can build a custom Actor on top.

Built by Ian Mu as part of his 100-Apify-Actor AI tooling portfolio. See the companion repo claude-verify-before-stop.