Pricing

from $0.01 / 1,000 results

Go to Apify Store

Super AI Bench (MCP)

Try for free

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

AIRabbit

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Super AI Bench MCP Server

Tired of guessing which AI model works best for your task? Stop jumping between GPT, Claude, Gemini, and dozens of others wondering which one is actually faster, cheaper, or smarter for what you need to build.

Super AI Bench solves this. Compare real performance data across any AI model, instantly find the best fit for your use case, and run predictions without leaving your MCP client.

Use it for:

Regression Testing: Assess new model versions against your current production prompts.
Vendor Comparison: Run "shootouts" between GPT, Claude, and open-source alternatives.
Ensemble/Consensus: Automate parallel requests to multiple models to reduce hallucinations.
Cost Optimization: Find the cheapest model that meets your specific quality threshold.

Typical flow: Preview fields → filter with jq → find candidates on Replicate → run the prompt.

Install

Add the server once, then use it from your MCP client. Claude Desktop example:

Option 1: Using URL Query Parameter

Use this if your MCP client doesn't support custom headers. The Replicate API key is passed in the URL:

{
  "mcpServers": {
    "super-ai-bench": {
      "command": "npx",
      "args": [
        "-y",
        "mcp-remote",
        "https://flamboyant-leaf--super-ai-bench-mcp.apify.actor/mcp?replicateApiKey=<REPLICATE_API_KEY>",
        "--header",
        "Authorization: Bearer <APIFY_API_TOKEN>"
      ]
    }
  }
}

Option 2: Using Headers

Use this if your MCP client supports custom headers:

{
  "mcpServers": {
    "super-ai-bench": {
      "command": "npx",
      "args": [
        "-y",
        "mcp-remote",
        "https://flamboyant-leaf--super-ai-bench-mcp.apify.actor/mcp",
        "--header",
        "Authorization: Bearer <APIFY_API_TOKEN>",
        "--header",
        "X-Replicate-API-Key: <REPLICATE_API_KEY>"
      ]
    }
  }
}

Note: The Authorization header is still required for Apify authentication. Only the Replicate API key can be passed via query parameter. Supported query parameter names: replicateApiKey or replicate_api_key.

Note: You only need a Replicate API key if you want to run predictions. For benchmarks only, you don't need it.

MCP Client Timeout

Some MCP clients have short default timeouts. This can happen especially when running predictions across multiple models, which can take longer. If you see timeouts during benchmark or Replicate calls, increase your MCP client timeout (e.g., to 120 seconds).

How It Works

Important: Start every chat prompt with "READ THE DOC" so the model reads the protocol and uses the MCP tools correctly.

This server exposes benchmark and Replicate tools. A typical flow is:

Preview fields with benchmark_get_*_preview.
Filter with jq using the benchmark tool.
Find models on Replicate with search_replicate_models_bulk.
Run the prompt with run_replicate_predictions_bulk_prompt.

Full tutorial: https://dev.to/theairabbit/the-end-of-ai-monogamy-let-ai-find-the-best-model-for-your-task-23p7

Supported Benchmarks & Data

We currently support:

LLM benchmarks: intelligence, coding, math, speed, and pricing metrics.
Text‑to‑image benchmarks: ELO ratings and ranks.

Note: Benchmark tools require jq. If the jq is invalid, too large, or returns empty/null fields, the tool returns an error plus a preview (sample + fields) to help you build a correct query.

Real Use Cases

Example 1: Ensemble Verification (Quality Assurance)

Goal: Reduce hallucinations by getting multiple opinions.

User: "Analyze this legal contract clause for loopholes. Get opinions from 4 different models. If they disagree, highlight the contradictions." What happens:

Benchmark: Filter models with high reasoning scores (e.g., GPQA/HLE).

Replicate: Run the prompt across multiple models.

Result: Compare and aggregate the answers.

Example 2: Regression Testing (New vs. Old Version)

Goal: Verify if a new model version breaks existing functionality.

User: "Run my standard 'JSON extraction' prompt on Llama-2-70b and the new Llama-3-70b. Display the outputs side-by-side." What happens: Use Replicate tools to run the same prompt on both models and compare outputs side‑by‑side.

Example 3: Optimization & Parallelization

Goal: Maximize speed and throughput.

User: "Run this classification task on the 5 fastest models simultaneously. I need a valid result in under 200ms." What happens: Filter by throughput (tokens/sec), then run the prompt across the fastest models and compare results.

Example 4: Image Comparison (Text-to-Image)

Goal: Compare image quality side‑by‑side from top models.

User: "Generate a cinematic product shot for a matte‑black water bottle. Compare the top 2 text‑to‑image models." What happens: Use the text‑to‑image benchmark to pick top models, run the same prompt on Replicate, and compare the outputs.

Example 5: Multi-Vendor "Shootout" (Coding)

Goal: Compare performance across different providers for a specific task.

User: "I need to fix a complex React bug. Fix it using the top 3 best coding models available right now." What happens:

Benchmark: Filter top coding models using benchmark metrics.

Replicate: Search and run the prompt across those models.

Result: Compare outputs and choose the best.

MCP Tools

Benchmarks: benchmark_get_llm_models, benchmark_get_llm_models_preview, benchmark_get_text_to_image_models, benchmark_get_text_to_image_models_preview
Replicate: search_replicate_models_bulk, run_replicate_predictions_bulk_prompt
Meta: actor_get_version, get_protocol_documentation, get_readme_documentation

Disclaimer: The tools and workflows presented in this article provide a preliminary glimpse into the performance of various AI models, but these results should not be taken for granted. Automated comparisons are illustrative and may not reflect performance across all scenarios. To fully understand the specific strengths and weaknesses of candidate models, you must independently verify the results against your own data and requirements.