Pricing

Pay per usage

Agent Eval Harness Finder

Catalog open-source agent eval harnesses & benchmarks (SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena). Combines GitHub search + curated seed list, scores by quality signals (stars, recency, license), parses README scope and sample model scores.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Yanlong Mu

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

What does Agent Eval Harness Finder do?

Agent Eval Harness Finder catalogs open-source agent evaluation harnesses and benchmarks — SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena, lm-evaluation-harness, HELM, RewardBench, and dozens more — and returns a ranked, deduplicated, production-ready inventory of where to find each one, how active it is, and what model scores have been published.

Instead of digging through papers and arxiv to figure out which benchmark to use for your agent framework, you run this Actor and get one structured dataset with repo, stars, benchmarkType, lastUpdated, license, scope, lastPublishedScore, and a 0-100 qualityScore. Sort, filter, export as CSV / JSON / Excel.

This Actor combines a curated seed list of 26+ canonical harness repos with live GitHub search so that newly published benchmarks surface automatically without manual updates.

Part of Ian Mu's 100-Apify-Actor portfolio for the AI / Claude tooling ecosystem (github.com/ianymu). See also: claude-verify-before-stop — a Claude Code hook that enforces real verification before tasks are marked complete.

Why use Agent Eval Harness Finder?

AI researchers — Find the right benchmark for your agent paper without spending an afternoon on Google Scholar.
Agent framework builders — Decide which harnesses to wire into your CI / regression suite (SWE-Bench for coding agents? ToolBench for tool-use? WebArena for browser agents?).
Journalists & analysts — Get a defensible, dated snapshot of "the most active agent benchmarks today" with quality scores you can cite.
Procurement / due diligence — When evaluating an AI agent vendor's claims ("we score X% on Y benchmark"), this Actor tells you whether Y benchmark is still maintained, who maintains it, and what other models score.
Trend detection — Schedule a daily run to spot newly-published benchmarks the moment they cross the star-threshold.

How to use Agent Eval Harness Finder

Open the Actor in Apify Console and click Try Actor.
(Optional) Filter by benchmark type (code-fixing, web-agent, tool-use, multi-agent, etc.). Leave empty for everything.
(Optional) Adjust minStars (default 100) and maxResults (default 30).
Click Start. Runs in roughly a minute.
Open the Output tab and download as CSV, JSON, or Excel, or hit the Dataset API endpoint to integrate downstream.

Input

Field	Type	Default	Description
`topicFilter`	array of strings	`[]`	Only include harnesses whose inferred type contains one of these strings (case-insensitive). Empty = no filter.
`minStars`	integer	`100`	Skip repos below this star count.
`maxResults`	integer	`30`	Stop after enriching this many harnesses.

Example input:

{
    "topicFilter": ["code-fixing", "tool-use"],
    "minStars": 200,
    "maxResults": 20
}

Output

Each row is a single harness. You can download the dataset in various formats such as JSON, HTML, CSV, or Excel.

Example output:

{
    "name": "SWE-bench",
    "repo": "princeton-nlp/SWE-bench",
    "url": "https://github.com/princeton-nlp/SWE-bench",
    "stars": 8200,
    "language": "Python",
    "benchmarkType": "code-fixing",
    "scope": "Real-world GitHub issues drawn from popular Python repositories. Multi-file fixes required to pass tests.",
    "lastPublishedScore": "Claude 3.5 Sonnet: 49%; GPT-4o: 38%",
    "license": "MIT",
    "lastUpdated": "2026-04-12T18:33:00Z",
    "qualityScore": 88,
    "source": "curated_seed"
}

A human-readable Markdown leaderboard is also written to the key-value store as eval-harness-catalog.md.

Data table

Field	Type	Description
`name`	string	Repo name (`SWE-bench`)
`repo`	string	`owner/name` GitHub identifier
`url`	URL	GitHub repository page
`description`	string	GitHub description
`stars`	integer	Star count
`forks`	integer	Fork count
`openIssues`	integer	Open-issue count
`language`	string	Primary language (usually Python)
`license`	string	SPDX license ID (MIT / Apache-2.0 / etc.)
`benchmarkType`	string	Inferred type: `code-fixing`, `code-generation`, `web-agent`, `tool-use`, `multi-agent`, `text-to-sql`, `reasoning`, `reward-model`, `general-agent`, `lm-general`, etc.
`scope`	string \| null	Parsed from README — what the benchmark actually tests
`lastPublishedScore`	string \| null	Sample model scores pulled from the README
`lastUpdated`	ISO date	Latest commit timestamp
`qualityScore`	integer	0-100 composite (stars / recency / license / docs / activity)
`source`	string	`curated_seed` or `search:<query>`

Pricing / Cost estimation

This Actor is cheap to run: ~8 GitHub Search API calls + roughly maxResults repo + README fetches per run. With a GITHUB_TOKEN environment variable set (5,000 req/hr quota), full runs comfortably stay under one minute. Without a token, GitHub limits unauthenticated requests to 60/hr — enough for one full run.

How much does it cost to run agent benchmark discovery? Effectively a fraction of a cent.

Tips or advanced options

Set GITHUB_TOKEN as an Actor secret to unlock 5,000 req/hr (vs 60 unauthenticated).
Daily trend tracking — Schedule the Actor daily, and diff today's catalog against yesterday's to spot newly published benchmarks.
Narrow by type — Use topicFilter to slice the catalog to just web-agent benchmarks, just code-fixing, etc.
Tune minStars — Drop to 50 to surface emerging benchmarks; raise to 1000 to get only the canonical ones.
Graceful failure — If GitHub search rate-limits, the curated seed list (26+ canonical repos) still produces a usable catalog.

FAQ, disclaimers, and support

Is this legal? Yes — GitHub's REST API is a public, documented interface and this Actor only requests publicly-listed repository metadata. No login required.

Why is benchmark X missing? Likely the repo has fewer than 100 stars or wasn't tagged with an agent-eval-related topic and wasn't in the curated seed list. Open an issue at github.com/ianymu and we'll add it to the seed list.

How accurate is benchmarkType? It's inferred from repo name + description + first 2 KB of README. Heuristic, not authoritative — but consistent enough to filter and group on. For canonical seed repos the type is hand-curated.

How accurate is lastPublishedScore? Best-effort regex extraction from the README. Many harnesses host leaderboards on external sites (e.g. swebench.com); for those, this field will often be null and you should follow the url for live scores.

Custom version? Need this wired into your research / engineering workflow with Slack alerts, per-benchmark deep dives, or paper-citation enrichment? Open an issue and we can build a custom Actor on top.

Built by Ian Mu as part of his 100-Apify-Actor AI tooling portfolio. See the companion repo claude-verify-before-stop.

Actor README Generator Agent

jkuzz/actor-readme-generator-agent

Generates a README for an Apify Actor using an AI Agent. You only need to provide the actor id to generate a readable and SEO optimized README for the Actor. The generation process utilizes Apify's Official README Guide. This Agent is open source (link in readme).

Jan Kuželík

MCP Server Catalog + Quality Score

ianymu/mcp-server-catalog

mcp-server-catalog is an Apify Actor that scrapes the top awesome-mcp-server GitHub lists, scores every MCP (Model Context Protocol) server on six quality dimensions (stars, recency, license, description, docs, activity), and returns a ranked dataset of production-ready MCP servers.

Yanlong Mu

Sample Actor

awesomealvin64/my-actor

Leone Dieujuste

GitHub Repository Intelligence

crawlerbros/github-repo-intelligence

Fetch rich metadata (stars, forks, README, languages, topics, license) from GitHub repositories. Search by query or provide direct URLs. Optional GitHub token for 80x higher rate limit.

Crawler Bros

5.0

GitHub Repos Scraper

gio21/github-repos-scraper

Search and scrape GitHub repositories. Extract stars, forks, language, license, topics, and more from the GitHub public API.

Gio

AI Model Tracker — LLM Benchmarks & Pricing

aurumworks/ai-model-tracker

Track AI model benchmarks, pricing, and performance. Get rankings, speed metrics, cost per token, and benchmark scores for 500+ LLMs from OpenAI, Anthropic, Google, Meta, and more. Updated weekly.

Aryan Saxena

LinkedIn Agent

apexronin/linkedin-agent

A linkedin agent

Jensin

GitHub Repository Search & Scraper

scrapeworks/github-repo-search

Search GitHub repositories by keyword, language, topic, stars, and date. Clean structured JSON with stars, forks, license, topics, owner, and activity dates. Optional token for high rate limits.

Nicolas van Arkens

Github Search Scraper

saswave/github-search-scraper

Github search scraper. Get all data from search results list

SASWAVE

5.0

Zillow Agent Data Scraper (Agent Listings, Reviews & Details)

coder_zoro/zillow-agent-data-scraper-agent-listings-reviews-details

Scrape complete Zillow agent data effortlessly. Get agent details, active/rental/sold listings, reviews, and search results with one API call. Ideal for real estate analytics, lead generation, and agent performance tracking.