# Agent Eval Harness Finder (`ianymu/agent-eval-harness-finder`) Actor

Catalog open-source agent eval harnesses & benchmarks (SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena). Combines GitHub search + curated seed list, scores by quality signals (stars, recency, license), parses README scope and sample model scores.

- **URL**: https://apify.com/ianymu/agent-eval-harness-finder.md
- **Developed by:** [Yanlong Mu](https://apify.com/ianymu) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### What does Agent Eval Harness Finder do?

**Agent Eval Harness Finder** catalogs **open-source agent evaluation harnesses and benchmarks** — SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena, lm-evaluation-harness, HELM, RewardBench, and dozens more — and returns a ranked, deduplicated, **production-ready inventory** of where to find each one, how active it is, and what model scores have been published.

Instead of digging through papers and arxiv to figure out which benchmark to use for your agent framework, you run this Actor and get **one structured dataset** with `repo`, `stars`, `benchmarkType`, `lastUpdated`, `license`, `scope`, `lastPublishedScore`, and a 0-100 `qualityScore`. Sort, filter, export as CSV / JSON / Excel.

This Actor combines a **curated seed list of 26+ canonical harness repos** with live **GitHub search** so that newly published benchmarks surface automatically without manual updates.

Part of Ian Mu's 100-Apify-Actor portfolio for the AI / Claude tooling ecosystem ([github.com/ianymu](https://github.com/ianymu)). See also: [`claude-verify-before-stop`](https://github.com/ianymu/claude-verify-before-stop) — a Claude Code hook that enforces real verification before tasks are marked complete.

### Why use Agent Eval Harness Finder?

- **AI researchers** — Find the right benchmark for your agent paper without spending an afternoon on Google Scholar.
- **Agent framework builders** — Decide which harnesses to wire into your CI / regression suite (SWE-Bench for coding agents? ToolBench for tool-use? WebArena for browser agents?).
- **Journalists & analysts** — Get a defensible, dated snapshot of "the most active agent benchmarks today" with quality scores you can cite.
- **Procurement / due diligence** — When evaluating an AI agent vendor's claims ("we score X% on Y benchmark"), this Actor tells you whether Y benchmark is still maintained, who maintains it, and what other models score.
- **Trend detection** — Schedule a daily run to spot newly-published benchmarks the moment they cross the star-threshold.

### How to use Agent Eval Harness Finder

1. Open the Actor in Apify Console and click **Try Actor**.
2. (Optional) Filter by benchmark type (`code-fixing`, `web-agent`, `tool-use`, `multi-agent`, etc.). Leave empty for everything.
3. (Optional) Adjust `minStars` (default 100) and `maxResults` (default 30).
4. Click **Start**. Runs in roughly a minute.
5. Open the **Output** tab and **download as CSV, JSON, or Excel**, or hit the Dataset API endpoint to integrate downstream.

### Input

| Field | Type | Default | Description |
|---|---|---|---|
| `topicFilter` | array of strings | `[]` | Only include harnesses whose inferred type contains one of these strings (case-insensitive). Empty = no filter. |
| `minStars` | integer | `100` | Skip repos below this star count. |
| `maxResults` | integer | `30` | Stop after enriching this many harnesses. |

Example input:

```json
{
    "topicFilter": ["code-fixing", "tool-use"],
    "minStars": 200,
    "maxResults": 20
}
````

### Output

Each row is a single harness. You can **download the dataset in various formats such as JSON, HTML, CSV, or Excel.**

Example output:

```json
{
    "name": "SWE-bench",
    "repo": "princeton-nlp/SWE-bench",
    "url": "https://github.com/princeton-nlp/SWE-bench",
    "stars": 8200,
    "language": "Python",
    "benchmarkType": "code-fixing",
    "scope": "Real-world GitHub issues drawn from popular Python repositories. Multi-file fixes required to pass tests.",
    "lastPublishedScore": "Claude 3.5 Sonnet: 49%; GPT-4o: 38%",
    "license": "MIT",
    "lastUpdated": "2026-04-12T18:33:00Z",
    "qualityScore": 88,
    "source": "curated_seed"
}
```

A human-readable Markdown leaderboard is also written to the key-value store as `eval-harness-catalog.md`.

### Data table

| Field | Type | Description |
|---|---|---|
| `name` | string | Repo name (`SWE-bench`) |
| `repo` | string | `owner/name` GitHub identifier |
| `url` | URL | GitHub repository page |
| `description` | string | GitHub description |
| `stars` | integer | Star count |
| `forks` | integer | Fork count |
| `openIssues` | integer | Open-issue count |
| `language` | string | Primary language (usually Python) |
| `license` | string | SPDX license ID (MIT / Apache-2.0 / etc.) |
| `benchmarkType` | string | Inferred type: `code-fixing`, `code-generation`, `web-agent`, `tool-use`, `multi-agent`, `text-to-sql`, `reasoning`, `reward-model`, `general-agent`, `lm-general`, etc. |
| `scope` | string | null | Parsed from README — what the benchmark actually tests |
| `lastPublishedScore` | string | null | Sample model scores pulled from the README |
| `lastUpdated` | ISO date | Latest commit timestamp |
| `qualityScore` | integer | 0-100 composite (stars / recency / license / docs / activity) |
| `source` | string | `curated_seed` or `search:<query>` |

### Pricing / Cost estimation

This Actor is cheap to run: ~8 GitHub Search API calls + roughly `maxResults` repo + README fetches per run. With a `GITHUB_TOKEN` environment variable set (5,000 req/hr quota), full runs comfortably stay under one minute. Without a token, GitHub limits unauthenticated requests to 60/hr — enough for one full run.

How much does it cost to run agent benchmark discovery? Effectively a fraction of a cent.

### Tips or advanced options

- **Set `GITHUB_TOKEN`** as an Actor secret to unlock 5,000 req/hr (vs 60 unauthenticated).
- **Daily trend tracking** — Schedule the Actor daily, and diff today's catalog against yesterday's to spot newly published benchmarks.
- **Narrow by type** — Use `topicFilter` to slice the catalog to just web-agent benchmarks, just code-fixing, etc.
- **Tune `minStars`** — Drop to 50 to surface emerging benchmarks; raise to 1000 to get only the canonical ones.
- **Graceful failure** — If GitHub search rate-limits, the curated seed list (26+ canonical repos) still produces a usable catalog.

### FAQ, disclaimers, and support

**Is this legal?** Yes — GitHub's REST API is a public, documented interface and this Actor only requests publicly-listed repository metadata. No login required.

**Why is benchmark X missing?** Likely the repo has fewer than 100 stars or wasn't tagged with an agent-eval-related topic and wasn't in the curated seed list. Open an issue at [github.com/ianymu](https://github.com/ianymu) and we'll add it to the seed list.

**How accurate is `benchmarkType`?** It's inferred from repo name + description + first 2 KB of README. Heuristic, not authoritative — but consistent enough to filter and group on. For canonical seed repos the type is hand-curated.

**How accurate is `lastPublishedScore`?** Best-effort regex extraction from the README. Many harnesses host leaderboards on external sites (e.g. swebench.com); for those, this field will often be null and you should follow the `url` for live scores.

**Custom version?** Need this wired into your research / engineering workflow with Slack alerts, per-benchmark deep dives, or paper-citation enrichment? Open an issue and we can build a custom Actor on top.

Built by **Ian Mu** as part of his 100-Apify-Actor AI tooling portfolio. See the companion repo [`claude-verify-before-stop`](https://github.com/ianymu/claude-verify-before-stop).

# Actor input Schema

## `topicFilter` (type: `array`):

Only include harnesses whose inferred type contains one of these strings (e.g. 'code-fixing', 'web-agent', 'tool-use'). Leave empty for all types.

## `minStars` (type: `integer`):

Skip repos with fewer than this many GitHub stars.

## `maxResults` (type: `integer`):

Stop after enriching this many harnesses.

## Actor input object example

```json
{
  "topicFilter": [],
  "minStars": 100,
  "maxResults": 30
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("ianymu/agent-eval-harness-finder").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("ianymu/agent-eval-harness-finder").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call ianymu/agent-eval-harness-finder --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ianymu/agent-eval-harness-finder",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Agent Eval Harness Finder",
        "description": "Catalog open-source agent eval harnesses & benchmarks (SWE-Bench, AgentBench, ToolBench, BIRD, GAIA, MAST, WebArena). Combines GitHub search + curated seed list, scores by quality signals (stars, recency, license), parses README scope and sample model scores.",
        "version": "0.0",
        "x-build-id": "GaRGe2q7wLuf7FKag"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ianymu~agent-eval-harness-finder/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ianymu-agent-eval-harness-finder",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ianymu~agent-eval-harness-finder/runs": {
            "post": {
                "operationId": "runs-sync-ianymu-agent-eval-harness-finder",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ianymu~agent-eval-harness-finder/run-sync": {
            "post": {
                "operationId": "run-sync-ianymu-agent-eval-harness-finder",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "topicFilter": {
                        "title": "Filter by benchmark type",
                        "type": "array",
                        "description": "Only include harnesses whose inferred type contains one of these strings (e.g. 'code-fixing', 'web-agent', 'tool-use'). Leave empty for all types.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "minStars": {
                        "title": "Minimum stars",
                        "minimum": 0,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Skip repos with fewer than this many GitHub stars.",
                        "default": 100
                    },
                    "maxResults": {
                        "title": "Max results",
                        "minimum": 5,
                        "maximum": 200,
                        "type": "integer",
                        "description": "Stop after enriching this many harnesses.",
                        "default": 30
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```