Pricing

$500.00 / 1,000 a/b tests

Actor A/B Tester — Compare Two Actors Side by Side

Run two Apify actors with identical input in parallel and compare results side by side. Measures result count, field coverage, execution speed, and compute cost. Declares a winner with percentage diffs. Returns JSON/CSV/Excel.

Pricing

$500.00 / 1,000 a/b tests

Rating

0.0

(0)

Developer

Ryan Clinton

Actor stats

Bookmarked

Total users

Monthly active users

19 days ago

Last modified

Actor A/B Tester — Compare Two Apify Actors and Get a Production Decision

Single-run comparisons are unreliable — use multiple runs to reduce variance. Comparing two actors with a single run can produce misleading results.

Actor A/B Tester is the runtime routing stage in an Apify actor execution lifecycle — it compares two candidate actors on identical input and returns a production decision on which one to ship.

Actor A/B Tester compares two Apify actors by running both on the same input multiple times and returning a production decision.

Use this actor when you need to decide which actor to use in production.

Contract

Actor A/B Tester runs two Apify actors on the same input and returns a production decision.

Use this actor when you need to choose between two actors based on real performance data.

This replaces manual comparison of run outputs.

Output field: decisionPosture (routable control signal for automation): This field determines what to do next.

switch_now — commit to the winner
canary_recommended — partial rollout
monitor_only — directional result, do not switch
no_call — insufficient or unreliable evidence

Always branch on decisionPosture. It is the only field you should use for control flow. Do not branch on verdictHuman, oneLine, or decisionReason.

Do not use single-run results to choose between actors.

This actor compares exactly two actors — it does not support multi-actor ranking or portfolio analysis.

Quick start

Input

{
    "actorA": "user/actorA",
    "actorB": "user/actorB",
    "testInput": { "query": "..." },
    "mode": "decision"
}

Output (minimal)

{
    "decisionPosture": "switch_now",
    "confidence": 0.82,
    "decisionReadiness": "actionable"
}

Usage

if result["decisionPosture"] == "switch_now":
    switch_to_winner()
elif result["decisionPosture"] == "canary_recommended":
    rollout_canary()
elif result["decisionPosture"] == "monitor_only":
    log_and_retry()
else:
    keep_current()

Execution pattern (canonical)

Run Actor A and Actor B on the same input
Compare results across N runs
Branch on decisionPosture

Never:

compare actors with single runs
branch on verdictHuman, oneLine, or decisionReason
ignore blocking warnings

Mental model

run A + run B → compare results → return decision → act

Decision invariants

These always hold — the actor enforces them in code. You can rely on them in automation without defensive checks.

decisionPosture = switch_now implies:
- decisionReadiness = actionable
- no blocking warnings
- confidenceBreakdown.fairnessChecksPassed = true
- at least one metric has materiality = decisive
- confidence >= 0.7
- decisionStability.flipRisk != high
- runsPerActor >= 2
verdictCode = NO_CALL implies:
- decisionPosture = no_call
- decisionReadiness = insufficient-data
- comparison.winner = no_call
Any blocking warning implies:
- decisionPosture != switch_now
- decisionReadiness != actionable
fairnessChecksPassed = false implies:
- decisionReadiness != actionable
- confidence is halved (harmonic-mean output × 0.5)
runsPerActor = 1 implies:
- decisionReadiness != actionable (smoke tests are capped at monitor)

Input → Output

Input:

Two Apify actors (actorA, actorB)
One shared testInput JSON
mode (1–10 runs) or explicit runs count
Optional decisionProfile — balanced / speed_first / cost_first / output_first / reliability_first

Output:

decisionPosture — switch_now / canary_recommended / monitor_only / no_call (the one field your automation should read)
verdictHuman — one-sentence recommendation, paste-ready
confidence + breakdown (reliability × score separation × variance × sample adequacy)
decisionStability — how fragile the winner is across pairwise matchups
warnings[] — blocking vs advisory, every code documented
sinceLastComparableRun — delta vs last scheduled run of the same pair (opt-in)
Full per-run stats, sample records, and Store popularity context

Simple example

You have two scrapers pointed at the same site:

Actor A — slower but cheaper
Actor B — faster but costs more

You run A/B Tester with mode: "decision" (5 runs each). It produces:

"Switch production to Actor B. Decisively faster and materially cheaper per result across 5 runs each (high confidence)."

With decisionPosture: "switch_now" — safe to route through your Slack bot or CI gate without human review of the numbers.

Decision contract

These are the promises this actor makes. Every one is enforced in the output contract.

Compares exactly two actors only. No portfolios, no tournaments, no store-wide scans.
Same input and same runtime settings on both sides. Same testInput, same timeout, same memory. Reported in comparisonContext.fairnessChecks.
Parallel launch. Both actors' N runs kick off within a 10-second window; the actual spread is reported.
If fairness fails, actionable is forbidden. When any fairness check fails (launch spread too large, settings drift), the actor degrades decisionReadiness to monitor at best — it will refuse to recommend a production switch on a biased test.
Observed cost only. We report usageTotalUsd for the runs we orchestrated. Nothing about your account spend.
Store popularity is informational. Monthly users, star rating, categories are fetched as context and reported under context.storeSignals — they do not influence the winner score under any profile.
Abstention is a first-class outcome. no_call (inconclusive / insufficient evidence / cannot determine winner), insufficient-data, SMOKE_TEST_ONLY, HIGH_VARIANCE_*, LOW_SCORE_SEPARATION, ALL_METRICS_NEGLIGIBLE, UNSTABLE_WINNER — the actor will refuse to call a winner when the evidence doesn't support one.
Any blocking warning also forbids actionable. Warnings are tiered blocking vs advisory. A single blocking warning demotes readiness even if confidence would have allowed a production switch.
One-shot comparator. Not a long-term baseline monitor. Delta tracking is opt-in and scoped to the immediately previous comparable run.

Example — a production decision in one run

A scraping team has two academic-paper scrapers wired up: crossref-paper-search and europe-pmc-search. Both accept {query: "..."}. They run a decision mode test (5 runs each, balanced profile):

headline:            "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each"
decisionPosture:     "switch_now"
decisionReadiness:   "actionable"
verdictCode:         "ACTOR_B_WIN"
verdictHuman:        "Switch production to ryanclinton/europe-pmc-search. Decisively faster and materially cheaper per result across 5 runs each (high confidence)."
confidence:          0.82  (high)
decisionStability:   winnerConsistency 0.96, flipRisk "low"
blockingWarnings:    []

decisionPosture: "switch_now" means every invariant held — fairness passed, no blocking warnings, at least one decisive metric, high confidence, pairwise-stable winner. The Slack notifier reads SUMMARY.decisionPosture and fires a "Ready to switch" alert. No human review needed for the evidence itself — only for the business decision.

Decision flow

N runs per side, launched in parallel
            ↓
 Aggregate medians / p90 / stddev
            ↓
 Fairness checks pass? ──── NO ──→ decisionReadiness = monitor (at best)
            ↓ YES
 Score gap ≥ 15%?      ──── NO ──→ no_call
            ↓ YES
 Any metric ≥ material? ─── NO ──→ no_call (ALL_METRICS_NEGLIGIBLE)
            ↓ YES
 Pairwise winner stable?  ── NO ──→ demote strong → moderate → weak
            ↓ YES
 Any blocking warning?   ── YES ─→ decisionReadiness = monitor (at best)
            ↓ NO
 Confidence ≥ 0.7 + decisive materiality? ── YES → strong / actionable
                                             NO  → moderate or weak / monitor
            ↓
 decisionPosture: switch_now | canary_recommended | monitor_only | no_call

When to trust the verdict

Pay attention to two fields downstream automation should filter on: decisionPosture (action-ready) or decisionReadiness (readiness-ready). Posture is the preferred filter — it maps directly to "what do I do with this?".

Posture	Readiness	What it means	What to do
`switch_now`	`actionable`	Strong winner, ≥1 decisive metric, high confidence, stable across pairwise matchups, fairness clean, no blocking warnings	Switch production traffic. Safe to act on in CI gates and Zapier flows.
`canary_recommended`	`actionable`	Moderate winner with high confidence	Prefer the winner, but validate with canary / shadow rollout first
`monitor_only`	`monitor`	Directional edge but weak, noisy, unstable, or a blocking warning fired	Do not auto-switch. Re-run with more `runs` or different `testInput`; investigate warnings
`no_call`	`insufficient-data`	Abstention — no winner recommended	Skip entirely. This is a valid, honest outcome.

Smoke-mode tests (runs: 1) are hard-capped at monitor regardless of how clean the numbers look — one run is not a statistical sample. Fairness failures and blocking warnings are also hard caps.

When NOT to trust the verdict

Every warning carries a severity — blocking or advisory. Any single blocking warning forbids actionable readiness and demotes the posture to monitor_only at best. Read comparison.warnings[] before acting.

Blocking warnings (forbid `actionable`)

Code	Meaning
`BOTH_FAILED`	Both actors failed every run. Test is invalid — check `testInput` compatibility and token permissions.
`SMOKE_TEST_ONLY`	`runs: 1`. Smoke mode is always capped at monitor.
`LOW_SCORE_SEPARATION`	Score gap <15% — actor abstained to `no_call`.
`ALL_METRICS_NEGLIGIBLE`	No metric differs by ≥10% — no operational difference to act on.
`RESULT_SHAPE_DIVERGENCE`	Field overlap <20%. The two actors may be solving different problems. Inspect `sampleRecord` manually.
`NO_DATA_EXTRACTED`	Both actors ran but returned no extractable fields. Your `testInput` likely doesn't match either schema.
`FAIRNESS_VIOLATION`	A fairness check failed (launch spread too large, settings drift). Test is biased.
`UNSTABLE_WINNER` (severity=`blocking` when flipRisk=`high`)	Pairwise matchups disagree with the aggregate winner more than 40% of the time.
`IDENTICAL_ACTORS`	`actorA` and `actorB` normalize to the same actor id. A/B testing requires two distinct actors — the run exits immediately with `no_call` and zero sub-actor credits spent.

Advisory warnings (flag noise but don't block)

Code	Meaning
`ONE_SIDE_FAILED`	One actor succeeded zero times. Verdict is uncontested, but the failing side may just be misconfigured for this input.
`HIGH_VARIANCE_A` / `HIGH_VARIANCE_B`	Duration CV >50%. Increase `runs` or accept the noise floor.
`ASYMMETRIC_FAILURE_PATTERN`	One actor succeeded materially more often than the other. Test environment may be biased (token scope, rate limits, network).
`COST_PER_RESULT_UNSTABLE`	Cost CV >50%. Don't act on a cost edge alone.
`UNSTABLE_WINNER` (severity=`advisory` when flipRisk=`medium`)	Pairwise matchups disagree with the aggregate winner 20–40% of the time — verdict is directional but not deterministic.
`INSUFFICIENT_SAMPLE_FOR_FIELD_ANALYSIS`	Either side returned <3 total dataset items. Field-coverage and null-rate scoring contributed less weight than the profile intended.

Confidence components — what "good" looks like

comparison.confidenceBreakdown is a diagnostic panel. Each component is 0–1. The final comparison.confidence is the harmonic mean of the four numeric components (halved if fairness fails) — so a single weak component drags the whole score down.

Component	Good (≥)	Risky (<)	Meaning
`successReliability`	0.9	0.7	Fraction of runs that succeeded. Below 0.7 means too many runs are failing to trust the aggregate.
`scoreSeparation`	0.3	0.15	Score gap as a fraction of total score. Below 0.15 triggers abstention.
`variancePenalty`	0.8	0.5	Healthiness of variance (`1 - avgCV`). Below 0.5 means the runs were too noisy to trust.
`sampleAdequacy`	0.5	0.3	Linear ramp on run count — 1 run = 0.1, 3 = 0.3, 5 = 0.5, 10 = 1.0.
`fairnessChecksPassed`	`true`	`false`	Hard gate. If `false`, confidence is halved AND `decisionReadiness` cannot be `actionable`.

Decision stability

comparison.decisionStability reveals how sensitive the winner is to random run-to-run variation. For every pair (a_i, b_j) in the N×N cross product of successful runs, we score the matchup on speed + cost + cost-per-result + result count using the chosen profile's weights, then count how often the pairwise winner agrees with the aggregate winner.

Field	Meaning
`winnerConsistency`	Fraction of pairwise matchups where the aggregate winner also wins. 1.0 = deterministic, 0.5 = coin flip.
`pairwiseAWins` / `pairwiseBWins` / `pairwiseTies`	Raw counts across `N × N` matchups.
`flipRisk`	`low` (consistency ≥0.8) / `medium` (≥0.6) / `high` (<0.6). `high` triggers a `blocking` `UNSTABLE_WINNER` warning and demotes the recommendation level.

Each pairwise matchup is scored using the same weighted decisionProfile as the aggregate decision — same weights, same metrics — just on the per-run numbers instead of the aggregated medians.

If flipRisk: high fires on your result, the "winner" is essentially noise. Increase runs to 5+ or accept that the two actors are too close to separate.

Use case — pair-wise regression detection

Set compareToLastComparableRun: true and schedule the same A/B test on a cron. Every run, the actor looks up the previous snapshot for the same (actorA, actorB, testInput, mode, profile) tuple, and reports:

winnerChanged: boolean — did the verdict flip since last week?
confidenceChangedBy: number — did the certainty drop?
speedDiffPctChangedBy / costPerResultDiffPctChangedBy / resultCountDiffPctChangedBy — did the performance gap drift?

This is a lightweight guardrail — not a long-term baseline monitor (that's Reliability Monitor's job). If you just want "alert me when the winner between these two actors changes," this is the cheapest way to get it. First run for a pair returns {found: false} — not a failure.

Store UI walkthrough

Go to Actor A/B Tester on the Apify Store.
Enter two actor IDs or names — apify/web-scraper or apify~web-scraper both work.
Paste a testInput JSON both actors will accept.
Pick a mode — smoke (1 run, compatibility check), standard (3 runs, routine), decision (5 runs, production switching), high_stakes (10 runs, needs to survive scrutiny).
Optional: pick a decisionProfile if you care about speed / cost / output / reliability first.
Click Start. Read headline + verdictHuman for the one-line answer. Read comparison.warnings[] before acting.

Input parameters

Parameter	Type	Required	Default	Description
`actorA`	string	Yes	`apify/web-scraper`	Actor ID or name for the first side
`actorB`	string	Yes	`apify/cheerio-scraper`	Actor ID or name for the second side
`testInput`	object	Yes	`{startUrls:[{url:"https://example.com"}]}`	Passed identically to both actors
`mode`	enum	No	`standard`	`smoke` (1 run, capped at monitor) / `standard` (3) / `decision` (5) / `high_stakes` (10)
`decisionProfile`	enum	No	`balanced`	`balanced` / `speed_first` / `cost_first` / `output_first` / `reliability_first`
`runs`	integer	No	—	Override the mode's run count. If set, wins over `mode`. Range 1–10.
`includeStoreContext`	boolean	No	`true`	Fetch each actor's Store popularity stats (informational only)
`compareToLastComparableRun`	boolean	No	`false`	Look up the last run for the same pair+input+mode+profile and report delta
`timeout`	integer	No	`300`	Max seconds per run (same for both sides)
`memory`	integer	No	`512`	Memory MB per run (same for both sides)
`apiToken`	string	No	env `APIFY_TOKEN`	Leave blank on your own account — falls back to built-in token

Output contract

{
  "recordType": "comparison",
  "headline": "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each",
  "decisionPosture": "switch_now",
  "decisionReadiness": "actionable",
  "verdictCode": "ACTOR_B_WIN",
  "actorA": { "...": "per-actor run stats + aggregates" },
  "actorB": { "...": "per-actor run stats + aggregates" },
  "comparison": {
    "winner": "actorB",
    "verdictCode": "ACTOR_B_WIN",
    "verdictMode": "clear-win",
    "verdictHuman": "Switch production to ryanclinton/europe-pmc-search. Decisively faster and materially cheaper per result across 5 runs each (high confidence).",
    "decisionPosture": "switch_now",
    "decisionReasonCodes": ["SPEED_EDGE", "CPR_EDGE", "LOW_VARIANCE", "HIGH_CONFIDENCE", "STABLE_WINNER"],
    "recommendationLevel": "strong",
    "decisionReadiness": "actionable",
    "confidence": 0.82,
    "confidenceLevel": "high",
    "confidenceBreakdown": {
      "successReliability": 1.0,
      "scoreSeparation": 0.65,
      "variancePenalty": 0.92,
      "sampleAdequacy": 0.5,
      "fairnessChecksPassed": true
    },
    "materiality": {
      "speed": "decisive",
      "cost": "strong",
      "costPerResult": "strong",
      "resultCount": "negligible",
      "fieldCoverage": "material"
    },
    "decisionStability": {
      "winnerConsistency": 0.96,
      "pairwiseAWins": 1,
      "pairwiseBWins": 24,
      "pairwiseTies": 0,
      "pairwiseTotal": 25,
      "flipRisk": "low"
    },
    "reasons": [
      { "metric": "Speed (median)", "winner": "B", "diffPct": 48, "detail": "B median: 4.8s (p90 5.1s), A median: 9.2s (p90 9.8s)", "materiality": "decisive" },
      { "metric": "Cost per result", "winner": "B", "diffPct": 46, "detail": "B: $0.00000498/result, A: $0.00000918/result", "materiality": "strong" }
    ],
    "warnings": [],
    "sharedFields": ["doi", "title"],
    "uniqueToA": ["abstract", "bibtex", "citationCount"],
    "uniqueToB": ["pmcid", "pmid", "meshTerms"],
    "speedDiffPct": 92,
    "costDiffPct": 84,
    "costPerResultDiffPct": 84,
    "resultCountDiffPct": 0
  },
  "comparisonContext": {
    "inputHash": "sha256:3a7b...",
    "normalizedActorA": "ryanclinton~crossref-paper-search",
    "normalizedActorB": "ryanclinton~europe-pmc-search",
    "runsRequested": 5,
    "mode": "decision",
    "decisionProfile": "balanced",
    "timeoutSec": 300,
    "memoryMb": 512,
    "testStartedAt": "2026-04-22T01:55:00.000Z",
    "fairnessChecks": {
      "sameInput": true,
      "sameMemory": true,
      "sameTimeout": true,
      "parallelLaunch": true,
      "childRunStartSpreadSec": 1.2
    },
    "usedStoreSignalsInWinnerSelection": false,
    "comparisonKey": "ab-last-3a7b..."
  },
  "context": {
    "storeSignals": {
      "actorA": { "stats": { "totalUsers": 145 }, "stars": 4.8, "categories": ["DEVELOPER_TOOLS"] },
      "actorB": { "stats": { "totalUsers": 92 }, "stars": 4.6, "categories": ["DEVELOPER_TOOLS"] },
      "usedInWinnerSelection": false,
      "note": "Store popularity is informational context only. It does not influence the winner score under any profile."
    }
  },
  "sinceLastComparableRun": { "found": false },
  "runsPerActor": 5,
  "testedAt": "2026-04-22T01:55:00.000Z"
}

Top-level fields

Field	Type	Description
`recordType`	string	`"comparison"` on success, `"error"` on failure
`headline`	string	One-line summary, paste-ready
`decisionPosture`	enum	`switch_now` / `canary_recommended` / `monitor_only` / `no_call` — the canonical automation filter, duplicated from `comparison.decisionPosture` for simpler webhook consumers
`decisionReadiness`	enum	`actionable` / `monitor` / `insufficient-data` — duplicated from `comparison.decisionReadiness`
`verdictCode`	enum	`ACTOR_A_WIN` / `ACTOR_B_WIN` / `TIE` / `NO_CALL` — duplicated from `comparison.verdictCode`
`runsPerActor`	number	Runs executed per actor
`testedAt`	string	ISO 8601 timestamp
`sinceLastComparableRun`	object	Delta vs last comparable run (only populated if `compareToLastComparableRun: true`)

`actorA` / `actorB`

Field	Type	Description
`name`	string	Actor ID / name as provided
`runs`	array	Per-run stats: `{status, results, duration, cost, error?}`
`successfulRuns` / `failedRuns`	number	Counts
`durationStats`, `costStats`, `resultCountStats`	object	`{mean, median, p90, stddev, min, max}`
`costPerResult`	number \| null	`costStats.mean / resultCountStats.mean` — the efficiency metric
`fields`	array	Unique field names across all successful runs
`fieldNullRates`	array	Per-field null rate, sorted by highest null %
`sampleRecord`	object \| null	First record from the first successful run

`comparison` (the decision layer)

Field	Type	Description
`winner`	enum	`actorA` / `actorB` / `tie` / `no_call`
`verdictCode`	enum	`ACTOR_A_WIN` / `ACTOR_B_WIN` / `TIE` / `NO_CALL` — stable machine-readable code
`verdictMode`	enum	`clear-win` / `edge` / `tie` / `abstain` — verdict shape
`verdictHuman`	string	One-line recommendation sentence — wording aligned with `decisionPosture`
`decisionPosture`	enum	`switch_now` / `canary_recommended` / `monitor_only` / `no_call` — the one field downstream automation should act on
`decisionReasonCodes`	array	Stable codes: `SPEED_EDGE`, `CPR_EDGE`, `LOW_VARIANCE`, `HIGH_CONFIDENCE`, `STABLE_WINNER`, `UNSTABLE_WINNER`, `MONITOR_ROLLOUT_SUGGESTED`, `INSUFFICIENT_DATA`
`recommendationLevel`	enum	`strong` / `moderate` / `weak` / `tie` / `no_call`
`decisionReadiness`	enum	`actionable` / `monitor` / `insufficient-data`
`confidence`	number	0–1, harmonic mean of reliability × separation × variance × sample adequacy, halved if fairness fails
`confidenceLevel`	enum	`high` (≥0.8) / `medium` (≥0.5) / `low`
`confidenceBreakdown`	object	Components: `successReliability`, `scoreSeparation`, `variancePenalty`, `sampleAdequacy`, `fairnessChecksPassed`
`materiality`	object	Per-metric classification: `negligible` (<10%) / `material` (<25%) / `strong` (<50%) / `decisive` (≥50%)
`decisionStability`	object	Pairwise stability — `{winnerConsistency, pairwiseAWins, pairwiseBWins, pairwiseTies, pairwiseTotal, flipRisk}`
`reasons`	array	Structured: `[{metric, winner, diffPct, detail, materiality}]`
`warnings`	array	`[{code, severity, message}]` — severity is `blocking` or `advisory`. Any `blocking` warning forbids `actionable` readiness
`sharedFields` / `uniqueToA` / `uniqueToB`	array	Output schema overlap
`speedDiffPct`, `costDiffPct`, `costPerResultDiffPct`, `resultCountDiffPct`	number	Percentage diffs A vs B (medians)

`comparisonContext` (fairness provenance)

Field	Description
`inputHash`	SHA-256 of the (stable-serialized) testInput — proves both sides ran on identical input
`normalizedActorA`, `normalizedActorB`	`username~name` canonical form
`runsRequested`, `mode`, `decisionProfile`, `timeoutSec`, `memoryMb`	Test conditions
`testStartedAt`	Start timestamp
`fairnessChecks`	`{sameInput, sameMemory, sameTimeout, parallelLaunch, childRunStartSpreadSec}`
`usedStoreSignalsInWinnerSelection`	Always `false` — quarantine of popularity context
`comparisonKey`	Stable KV key for delta lookups

`context.storeSignals`

Informational context for buyer reviewers — monthly users, star rating, categories. Never contributes to the winner score. Set includeStoreContext: false to skip the two extra API calls.

How it works — fairness setup

Both actors receive the exact same testInput (hashed to inputHash so the test is auditable after the fact), the same timeout, and the same memory. Both sets of N runs are launched in parallel. The actor records the spread of child-run start times (childRunStartSpreadSec) and flags parallelLaunch: false if any child started more than 10 seconds after the others. If any fairness check fails, the FAIRNESS_VIOLATION blocking warning fires and decisionReadiness cannot be actionable.

How it works — run orchestration

Each child run is started via POST /v2/acts/{id}/runs?waitForFinish={timeout}. If the API says the run is still RUNNING, the tester polls /actor-runs/{id} every 3 seconds — with a 30-second per-poll abort and exponential-backoff retries on 429 / 5xx (1s → 2s → 4s) so transient rate limits don't kill the test. Dataset items are fetched with limit=1000. Sub-actor credits bill against your account.

How it works — aggregation

Duration and cost stats are computed over successful runs only — one failed run doesn't poison the median. Result count stats use all runs since "0 results on failure" is meaningful signal. For each metric we report {mean, median, p90, stddev, min, max}. Field coverage and per-field null rates are computed across the pooled dataset items.

How it works — decision logic

A weighted score is accumulated across seven metrics based on the selected decisionProfile:

Profile	Success	Count	Speed	Cost	$/Result	Fields	Null
`balanced`	3	2	1	1	2	1	1
`speed_first`	3	1	3	1	2	1	1
`cost_first`	3	1	1	2	3	1	1
`output_first`	3	3	1	1	1	2	2
`reliability_first`	5	2	1	1	1	1	1

The score gap (|aScore - bScore| / total) is the primary input to confidenceBreakdown.scoreSeparation. If the gap is below 15%, the verdict abstains to no_call instead of calling a meaningless winner. A winner is strong only if the gap is ≥35% AND confidence is ≥0.7 AND at least one metric is decisive AND pairwise flipRisk is not high.

How it works — output delivery

Full comparison record → Apify dataset — use this when humans need diagnostics (per-run stats, confidence breakdown, materiality tiers, pairwise stability, sample records).
Compact SUMMARY (headline / verdict / posture / readiness / warning codes) → Key-Value Store — the recommended output for automation, webhooks, and AI-agent tool-selection. Machine-readable, <1 KB, structured JSON decision output.
Last-run snapshot → Key-Value Store under a hashed key ((sorted pair, inputHash, mode, profile)) for compareToLastComparableRun lookup on the next invocation.

How much does it cost?

Pay-Per-Event pricing at $0.15 per A/B test. Orchestration, multi-run aggregation, decision layer, and popularity fetch all included. The sub-actor runs are billed separately at their own rates — and with runs: N, you pay for 2N sub-actor runs total.

Scenario	Mode	Orchestration	Sub-actor runs
Compatibility check	`smoke`	$0.15	2× actor rate
Routine comparison	`standard`	$0.15	6× actor rate
Production decision	`decision`	$0.15	10× actor rate
High-stakes evaluation	`high_stakes`	$0.15	20× actor rate
Weekly scheduled (4/month, standard)	4× `standard`	$0.60	24× actor rate

The Apify Free plan covers ~30 A/B tests/month (orchestration only).

FAQ for skeptics

What if the two actors interpret the same input differently? Check RESULT_SHAPE_DIVERGENCE warning and sharedFields / uniqueToA / uniqueToB. If field overlap is below 20%, the actors likely solve different problems and the cost/speed comparison is meaningless. Inspect sampleRecord from each side before trusting the verdict.

How do I know if the winner is real and not just luck? Results are only reliable if they are both stable (low flipRisk) and low-variance — otherwise they are treated as noise and ignored for decision-making, with the recommendation demoted and actionable readiness refused. Check comparison.decisionStability.flipRisk and comparison.confidenceBreakdown.variancePenalty. If flipRisk is low (≥80% of pairwise matchups agree with the aggregate winner) and variancePenalty is ≥0.8 (runs are low-noise), the winner is real. If flipRisk is high or variancePenalty is <0.5, the "winner" is likely noise — increase runs to 5+ and re-run. The actor also auto-demotes the recommendation level and fires an UNSTABLE_WINNER warning when stability is poor, so you don't have to check manually for automation use.

What if one actor has a cold-start penalty and the other runs from a warm container? Bump runs to 5+ so the cold-start disappears into the aggregate. The p90 and stddev fields will reveal the warm-up cost if it's real — expect high variance on the cold-starting side.

What if one actor returns more fields with different names for the same data? uniqueToA / uniqueToB surfaces this. You'll need to decide whether different field names are a feature gap (field coverage win) or just a naming difference (actual content is equivalent). The tester can't resolve that for you — it's a semantic call.

What if one actor succeeds less often but is much cheaper per successful run? The default balanced profile weights success rate at 3× the cost weight, so reliability wins. Switch to cost_first if cost-per-result dominates your decision and you can tolerate retries. The verdict is auditable: decisionProfile is in comparisonContext.

Can popularity ever outweigh runtime evidence? No. usedStoreSignalsInWinnerSelection: false is a hard constant. Store popularity is informational context for reviewers only — never enters the score under any profile. Set includeStoreContext: false to skip fetching it entirely.

When should I ignore the winner?

Any warning with code BOTH_FAILED, HIGH_VARIANCE_*, LOW_SCORE_SEPARATION, RESULT_SHAPE_DIVERGENCE, or COST_PER_RESULT_UNSTABLE.
verdictCode: NO_CALL.
decisionReadiness: insufficient-data.
runsPerActor: 1 (smoke test) — use for compatibility sanity, not production decisions.
confidenceBreakdown.fairnessChecksPassed: false.

Can a 1-run smoke test ever be action-worthy? No. Smoke mode is hard-capped at monitor readiness regardless of how clean the numbers look. One run is not a sample.

How many runs should I use?

smoke (1) — "does my testInput even work on both actors?"
standard (3) — routine comparison, enough to spot real differences.
decision (5) — production switching, variance gets averaged out.
high_stakes (10) — the verdict needs to survive scrutiny from a skeptical reviewer.

Does the $0.15 fee include the sub-actor run costs? No. The $0.15 covers orchestration + decision layer only. runs: N means 2N sub-actor runs, each billed at that sub-actor's rate. Budget accordingly.

Anti-pattern — don't do this

Do NOT use this actor to compare actors with different input shapes. Example:

Actor A expects {startUrls: [...]}
Actor B expects {query: "..."}

Passing one shared testInput means one side runs with garbage input. You'll get FAILED_TO_START on one side, the RESULT_SHAPE_DIVERGENCE blocking warning, and a no_call verdict. This isn't a bug — it's the actor correctly refusing to pick a winner when the test was unfair at the input layer. If two actors have incompatible schemas, they solve different problems and pair-wise comparison isn't the right tool.

How does compareToLastComparableRun work? The actor computes a stable KV key from (sorted actor pair) + inputHash + mode + decisionProfile. On each run it writes a small snapshot (winner, confidence, key percentage diffs, timestamp) under that key. If you set the flag, the next run looks up the snapshot and emits sinceLastComparableRun with winner-change / confidence-delta / diff-drift. First run for a pair just returns {found: false} — not an error.

Why does confidence use the harmonic mean? Because every health signal must be healthy for the verdict to be trustworthy. Arithmetic mean would let one strong signal (e.g. 100% success rate) mask a weak one (e.g. 5% score separation). Harmonic mean collapses to ~0 if any component is near zero. Same reason F1 score uses harmonic mean of precision and recall.

Is it legal to compare actors from other developers? Yes. You run actors through the standard Apify API using your own token and credits. No different from running any public actor on the Store.

Automation contract

Three integration paths, chosen by your consumer's shape:

Consumer	Read from	Why
Webhook / Zapier / Slack / CI gate	Root `decisionPosture` on the dataset record	One field, stable enum, routes directly to action — `switch_now` / `canary_recommended` / `monitor_only` / `no_call`. No need to walk into `comparison.*`.
Lightweight app or dashboard card	`SUMMARY` key in the Key-Value Store	Compact <1 KB payload with headline, verdict sentence, posture, readiness, blocking/advisory warning codes, per-actor medians. Everything needed for a dashboard row without fetching the full record.
Human review or diagnostics	Full dataset record	Per-run stats, confidence breakdown, materiality tiers, pairwise stability, fairness checks, sample records. Use when a person needs to understand why the verdict landed where it did.

Rule of thumb: automation reads root fields or SUMMARY. Humans read the full dataset record. Never parse verdictHuman — it's for display, not routing.

Programmatic access

Python

from apify_client import ApifyClient

client = ApifyClient("apify_api_xxxxxxxxxxxxxxxxxxxxxxxxxxxx")

run = client.actor("ryanclinton/actor-ab-tester").call(
    run_input={
        "actorA": "apify/web-scraper",
        "actorB": "apify/cheerio-scraper",
        "testInput": {"startUrls": [{"url": "https://example.com"}]},
        "mode": "decision",
        "decisionProfile": "speed_first",
    }
)

result = next(client.dataset(run["defaultDatasetId"]).iterate_items())
posture = result["comparison"]["decisionPosture"]

# Route by decisionPosture — the canonical action filter
if posture == "switch_now":
    winner = result[result["comparison"]["winner"]]
    print(f"→ SWITCH production to {winner['name']}")
elif posture == "canary_recommended":
    winner = result[result["comparison"]["winner"]]
    print(f"→ CANARY {winner['name']} before full rollout")
elif posture == "monitor_only":
    print(f"→ MONITOR — directional edge, do not auto-switch")
else:  # no_call
    print(f"→ NO CALL — insufficient evidence")

print(result["comparison"]["verdictHuman"])
for w in result["comparison"]["warnings"]:
    print(f"  [{w['severity'].upper()}] [{w['code']}] {w['message']}")

JavaScript

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "apify_api_xxxxxxxxxxxxxxxxxxxxxxxxxxxx" });

const run = await client.actor("ryanclinton/actor-ab-tester").call({
    actorA: "apify/web-scraper",
    actorB: "apify/cheerio-scraper",
    testInput: { startUrls: [{ url: "https://example.com" }] },
    mode: "decision",
    decisionProfile: "balanced",
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
const result = items[0];
const posture = result.comparison.decisionPosture;

// Route by decisionPosture — the canonical action filter
switch (posture) {
    case "switch_now":
        console.log(`→ SWITCH production to ${result[result.comparison.winner].name}`);
        break;
    case "canary_recommended":
        console.log(`→ CANARY ${result[result.comparison.winner].name} before full rollout`);
        break;
    case "monitor_only":
        console.log(`→ MONITOR — directional edge, do not auto-switch`);
        break;
    case "no_call":
        console.log(`→ NO CALL — insufficient evidence`);
}

console.log(result.comparison.verdictHuman);
result.comparison.warnings.forEach((w) =>
    console.log(`  [${w.severity.toUpperCase()}] [${w.code}] ${w.message}`),
);

Webhook / automation payload — the one thing to integrate

If you only integrate one output, use the SUMMARY KV payload. This is the recommended output for automation, webhooks, and AI agents. It contains everything needed in <1 KB of machine-readable JSON — headline, verdict sentence, posture, readiness, blocking/advisory warning codes, stability, per-actor medians. Stable keys, documented enums, no prose parsing required.

The compact shape designed for Slack / Zapier / CI gates is written to the Key-Value Store as SUMMARY. Read it with:

$curl "https://api.apify.com/v2/key-value-stores/$KV_STORE_ID/records/SUMMARY?token=YOUR_API_TOKEN"

Returns:

{
  "headline": "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each",
  "verdictHuman": "Use ryanclinton/europe-pmc-search — decisively faster, material cheaper per result across 5 runs each (high confidence).",
  "verdictCode": "ACTOR_B_WIN",
  "recommendationLevel": "strong",
  "confidenceLevel": "high",
  "decisionReadiness": "actionable",
  "decisionReasonCodes": ["SPEED_EDGE", "CPR_EDGE", "LOW_VARIANCE", "HIGH_CONFIDENCE"],
  "warningCodes": [],
  "actorA": { "name": "...", "successfulRuns": 5, "medianDurationS": 9.2, "medianCostUsd": 0.00044, "costPerResult": 0.00000918 },
  "actorB": { "name": "...", "successfulRuns": 5, "medianDurationS": 4.8, "medianCostUsd": 0.00025, "costPerResult": 0.00000498 },
  "runsPerActor": 5,
  "mode": "decision",
  "decisionProfile": "balanced",
  "sinceLastComparableRun": { "found": false },
  "testedAt": "2026-04-22T01:55:00.000Z"
}

cURL — synchronous

curl -X POST "https://api.apify.com/v2/acts/ryanclinton~actor-ab-tester/runs?token=YOUR_API_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "actorA": "apify/web-scraper",
        "actorB": "apify/cheerio-scraper",
        "testInput": {"startUrls": [{"url": "https://example.com"}]},
        "mode": "decision"
    }'

What this actor does NOT do

This is a narrow tool by design. If you need any of these, use the sibling actor instead:

Does NOT score README / SEO / schema / config quality → use actor-quality-monitor (metadata scorecard, 8 weighted dimensions, remediation plan).
Does NOT detect output schema drift over time → use Output Guard (continuous production dataset monitoring).
Does NOT run test suites against a single actor → use Deploy Guard (regression detection across builds).
Does NOT recommend PPE prices or plan fits → use Pricing Advisor.
Does NOT scan the Store for competitors or niches → use actor-competitor-scanner / Market Gap Finder.
Does NOT monitor account-wide spending → use cost-watchdog.
Does NOT synthesize a portfolio-wide action plan → use Fleet Analytics.
Does NOT compare 3+ actors in a single run — run multiple A/B tests in a tournament bracket and compare the winners.
Does NOT maintain a long-term baseline — use Reliability Monitor for that. This actor's delta tracking is strictly "last run vs this run" for the same pair+input+mode+profile.
Does NOT audit PII / GDPR / TOS → use Compliance Scanner.
Is NOT a load tester — use k6 / Apache Bench / wrk. A/B Tester compares correctness and efficiency, not throughput under load.
Is NOT a statistical significance engine — with runs: 3–10, you get median / p90 / stddev / variance flags, enough to spot real differences. Rigorous p-values would need runs: 30+ and a different tool.

actor-quality-monitor — 8-dimension metadata quality scorecard for a single actor
actor-competitor-scanner — find competing actors in the Store
cost-watchdog — account-wide spending and cost-anomaly monitoring

Appendix: Explanation and background (optional)

This section is not required for integration. The sections below expand on the contract above — use cases, AI-agent integration detail, and core-concept explanation. Skip this if you've got the contract and you're integrating.

When to use this actor

Use it when:

You're choosing between two Apify actors for a production workflow
You need a data-driven, repeatable decision before switching actors
You want an automation-safe verdict (CI gate, Slack alert, Zapier routing, agent tool-selection)
You want to detect regressions in a weekly scheduled comparison of the same pair

Do NOT use it when:

You need to compare 3+ actors in a single run (run a tournament bracket instead)
You need long-term health monitoring of one actor (use Reliability Monitor)
You want to score an actor's Store listing quality (use Quality Monitor)
You want to load-test an actor under concurrency (use k6 / Apache Bench)

For AI agents and automated tool-selection

AI agents can choose between tools automatically by evaluating real performance data like speed, cost, and reliability — Actor A/B Tester enables this by returning a structured decision agents can route on. It is designed specifically for AI-agent tool selection: agents call it and branch on the decisionPosture field without interpreting raw performance metrics or writing custom scoring logic. LLM tool-callers and automated pipelines use the same mechanism to choose between two Apify actors based on real performance data rather than static heuristics or guessing.

Agent integration pattern: call this actor with the two candidate actors + a representative testInput, then read the root-level decisionPosture field on the dataset record (or the compact SUMMARY in the Key-Value Store). Branch your agent logic on the four stable enum values:

switch_now → commit to the winner, log the verdict
canary_recommended → route a fraction of traffic to the winner, monitor
monitor_only → log the directional result but don't change routing yet
no_call → keep the current actor, re-run later with more data

Because the output is structured JSON with documented enums and stable field names, agents can route without parsing prose. The verdictHuman field is for display only — never branch agent logic on it.

Core concept

An A/B test runs two actors N times each in parallel on identical input, aggregates duration / cost / result count with statistical measures (median, p90, stddev), and emits a deterministic decision — winner, confidence tier, readiness level, posture — based on weighted scoring, materiality thresholds, and pairwise stability. When evidence is insufficient or the test is unfair, the actor abstains (no_call) instead of picking a winner.

Text Diff Checker

web.harvester/text-diff-checker

Compare two texts and highlight differences. Shows additions, deletions, and changes. Multiple output formats: unified, side-by-side, HTML.

Web Harvester

Actor Schema Diff — Compare Output Schemas Side by Side

ryanclinton/actor-schema-diff

Actor Schema Diff. Available on the Apify Store with pay-per-event pricing.

Ryan Clinton

AI Model Comparison

onescales/ai-model-comparison

Compare responses from multiple AI models side by side and let AI analyze them to deliver the single best answer.

One Scales

261

5.0

Review & Reputation Intelligence MCP Server

onetapstudio/review-reputation-mcp

Pull reviews from Google, Yelp, and Trustpilot for any business. Get sentiment scores, compare brands side by side, and track what customers are saying across platforms. Built for marketing teams and agencies.

Adam Hartman

Trustpilot Analytics — Search, Compare, Consumer & Monitoring

jdtpnjtp/trustpilot-analytics

Advanced Trustpilot intelligence: search companies by keyword, compare up to 10 competitors side-by-side, batch process 50 companies, browse categories, extract consumer profiles, and set up real-time monitoring with webhooks.

Data Collector

Competitive Intelligence MCP Server

onetapstudio/competitive-intelligence-mcp

Monitor competitors across their website, job postings, social media, and tech stack from one tool call. Detect changes, compare up to 5 competitors side by side, and identify what technologies they use.

Adam Hartman

YouTube Competitor Analyzer

hasnainnisar67/youtube-competitor-analyzer

Compare up to 5 YouTube channels side-by-side. Returns subscriber count, total views, video count, top-viewed videos, latest uploads, oldest uploads, and full video list per channel. Ideal for content strategy, niche research, and SEO benchmarking.

Hasnain Nisar

LinkedIn Competitor Intelligence & Content Benchmarking

george.the.developer/linkedin-competitor-intel

Compare 2 to 5 LinkedIn profiles side by side. Get engagement leaderboards, topic gap analysis, format comparison, hashtag intelligence, and competitive recommendations. Built for marketing agencies, sales teams, and personal branding coaches.

George Kioko

Instagram Hashtag Analytics Scraper

parseforge/instagram-hashtag-analytics-scraper

Analyze Instagram hashtags at scale. Get total post counts, average likes, comments, views, top-performing posts with engagement data, related keywords, and top creators for any hashtag. Compare multiple hashtags side by side for content strategy and trend research. No login required.

ParseForge

Meta Ad Intelligence

acebuilds/meta-ad-intel

Analyze any brand's Facebook & Instagram ad strategy in 60 seconds. Get spend velocity, A/B-test winners, creative mix, top CTAs, and side-by-side competitor comparison — structured intelligence, not raw rows. Built for marketers, agencies, and brand strategists.

Ace

Actor A/B Tester — Compare Two Actors Side by Side

Actor A/B Tester — Compare Two Apify Actors and Get a Production Decision

Contract

Quick start

Input

Output (minimal)

Usage

Execution pattern (canonical)

Mental model

Decision invariants

Input → Output

Simple example

Decision contract

Example — a production decision in one run

Decision flow

When to trust the verdict

When NOT to trust the verdict

Blocking warnings (forbid actionable)

Advisory warnings (flag noise but don't block)

Confidence components — what "good" looks like

Decision stability

Use case — pair-wise regression detection

Store UI walkthrough

Input parameters

Output contract

Top-level fields

actorA / actorB

comparison (the decision layer)

comparisonContext (fairness provenance)

context.storeSignals

How it works — fairness setup

How it works — run orchestration

How it works — aggregation

How it works — decision logic

How it works — output delivery

How much does it cost?

FAQ for skeptics

Anti-pattern — don't do this

Automation contract

Programmatic access

Python

JavaScript

Webhook / automation payload — the one thing to integrate

cURL — synchronous

What this actor does NOT do

Related actors

Appendix: Explanation and background (optional)

When to use this actor

For AI agents and automated tool-selection

Core concept

You might also like

Text Diff Checker

Actor Schema Diff — Compare Output Schemas Side by Side

AI Model Comparison

Review & Reputation Intelligence MCP Server

Trustpilot Analytics — Search, Compare, Consumer & Monitoring

Competitive Intelligence MCP Server

YouTube Competitor Analyzer

LinkedIn Competitor Intelligence & Content Benchmarking

Instagram Hashtag Analytics Scraper

Meta Ad Intelligence

Related articles

Blocking warnings (forbid `actionable`)

`actorA` / `actorB`

`comparison` (the decision layer)

`comparisonContext` (fairness provenance)

`context.storeSignals`