Actor A/B Tester — Compare Two Actors Side by Side avatar

Actor A/B Tester — Compare Two Actors Side by Side

Pricing

$500.00 / 1,000 a/b tests

Go to Apify Store
Actor A/B Tester — Compare Two Actors Side by Side

Actor A/B Tester — Compare Two Actors Side by Side

Run two Apify actors with identical input in parallel and compare results side by side. Measures result count, field coverage, execution speed, and compute cost. Declares a winner with percentage diffs. Returns JSON/CSV/Excel.

Pricing

$500.00 / 1,000 a/b tests

Rating

0.0

(0)

Developer

ryan clinton

ryan clinton

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Actor A/B Tester — Compare Two Apify Actors and Get a Production Decision

Single-run comparisons are unreliable — use multiple runs to reduce variance. Comparing two actors with a single run can produce misleading results.

Actor A/B Tester is the runtime routing stage in an Apify actor execution lifecycle — it compares two candidate actors on identical input and returns a production decision on which one to ship.

Actor A/B Tester compares two Apify actors by running both on the same input multiple times and returning a production decision.

Use this actor when you need to decide which actor to use in production.

Contract

Actor A/B Tester runs two Apify actors on the same input and returns a production decision.

Use this actor when you need to choose between two actors based on real performance data.

This replaces manual comparison of run outputs.

Output field: decisionPosture (routable control signal for automation): This field determines what to do next.

  • switch_now — commit to the winner
  • canary_recommended — partial rollout
  • monitor_only — directional result, do not switch
  • no_call — insufficient or unreliable evidence

Always branch on decisionPosture. It is the only field you should use for control flow. Do not branch on verdictHuman, oneLine, or decisionReason.

Do not use single-run results to choose between actors.

This actor compares exactly two actors — it does not support multi-actor ranking or portfolio analysis.

Quick start

Input

{
"actorA": "user/actorA",
"actorB": "user/actorB",
"testInput": { "query": "..." },
"mode": "decision"
}

Output (minimal)

{
"decisionPosture": "switch_now",
"confidence": 0.82,
"decisionReadiness": "actionable"
}

Usage

if result["decisionPosture"] == "switch_now":
switch_to_winner()
elif result["decisionPosture"] == "canary_recommended":
rollout_canary()
elif result["decisionPosture"] == "monitor_only":
log_and_retry()
else:
keep_current()

Execution pattern (canonical)

  1. Run Actor A and Actor B on the same input
  2. Compare results across N runs
  3. Branch on decisionPosture

Never:

  • compare actors with single runs
  • branch on verdictHuman, oneLine, or decisionReason
  • ignore blocking warnings

Mental model

run A + run B → compare results → return decision → act

Decision invariants

These always hold — the actor enforces them in code. You can rely on them in automation without defensive checks.

  • decisionPosture = switch_now implies:

    • decisionReadiness = actionable
    • no blocking warnings
    • confidenceBreakdown.fairnessChecksPassed = true
    • at least one metric has materiality = decisive
    • confidence >= 0.7
    • decisionStability.flipRisk != high
    • runsPerActor >= 2
  • verdictCode = NO_CALL implies:

    • decisionPosture = no_call
    • decisionReadiness = insufficient-data
    • comparison.winner = no_call
  • Any blocking warning implies:

    • decisionPosture != switch_now
    • decisionReadiness != actionable
  • fairnessChecksPassed = false implies:

    • decisionReadiness != actionable
    • confidence is halved (harmonic-mean output × 0.5)
  • runsPerActor = 1 implies:

    • decisionReadiness != actionable (smoke tests are capped at monitor)

Input → Output

Input:

  • Two Apify actors (actorA, actorB)
  • One shared testInput JSON
  • mode (1–10 runs) or explicit runs count
  • Optional decisionProfile — balanced / speed_first / cost_first / output_first / reliability_first

Output:

  • decisionPostureswitch_now / canary_recommended / monitor_only / no_call (the one field your automation should read)
  • verdictHuman — one-sentence recommendation, paste-ready
  • confidence + breakdown (reliability × score separation × variance × sample adequacy)
  • decisionStability — how fragile the winner is across pairwise matchups
  • warnings[]blocking vs advisory, every code documented
  • sinceLastComparableRun — delta vs last scheduled run of the same pair (opt-in)
  • Full per-run stats, sample records, and Store popularity context

Simple example

You have two scrapers pointed at the same site:

  • Actor A — slower but cheaper
  • Actor B — faster but costs more

You run A/B Tester with mode: "decision" (5 runs each). It produces:

"Switch production to Actor B. Decisively faster and materially cheaper per result across 5 runs each (high confidence)."

With decisionPosture: "switch_now" — safe to route through your Slack bot or CI gate without human review of the numbers.

Decision contract

These are the promises this actor makes. Every one is enforced in the output contract.

  • Compares exactly two actors only. No portfolios, no tournaments, no store-wide scans.
  • Same input and same runtime settings on both sides. Same testInput, same timeout, same memory. Reported in comparisonContext.fairnessChecks.
  • Parallel launch. Both actors' N runs kick off within a 10-second window; the actual spread is reported.
  • If fairness fails, actionable is forbidden. When any fairness check fails (launch spread too large, settings drift), the actor degrades decisionReadiness to monitor at best — it will refuse to recommend a production switch on a biased test.
  • Observed cost only. We report usageTotalUsd for the runs we orchestrated. Nothing about your account spend.
  • Store popularity is informational. Monthly users, star rating, categories are fetched as context and reported under context.storeSignals — they do not influence the winner score under any profile.
  • Abstention is a first-class outcome. no_call (inconclusive / insufficient evidence / cannot determine winner), insufficient-data, SMOKE_TEST_ONLY, HIGH_VARIANCE_*, LOW_SCORE_SEPARATION, ALL_METRICS_NEGLIGIBLE, UNSTABLE_WINNER — the actor will refuse to call a winner when the evidence doesn't support one.
  • Any blocking warning also forbids actionable. Warnings are tiered blocking vs advisory. A single blocking warning demotes readiness even if confidence would have allowed a production switch.
  • One-shot comparator. Not a long-term baseline monitor. Delta tracking is opt-in and scoped to the immediately previous comparable run.

Example — a production decision in one run

A scraping team has two academic-paper scrapers wired up: crossref-paper-search and europe-pmc-search. Both accept {query: "..."}. They run a decision mode test (5 runs each, balanced profile):

headline: "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each"
decisionPosture: "switch_now"
decisionReadiness: "actionable"
verdictCode: "ACTOR_B_WIN"
verdictHuman: "Switch production to ryanclinton/europe-pmc-search. Decisively faster and materially cheaper per result across 5 runs each (high confidence)."
confidence: 0.82 (high)
decisionStability: winnerConsistency 0.96, flipRisk "low"
blockingWarnings: []

decisionPosture: "switch_now" means every invariant held — fairness passed, no blocking warnings, at least one decisive metric, high confidence, pairwise-stable winner. The Slack notifier reads SUMMARY.decisionPosture and fires a "Ready to switch" alert. No human review needed for the evidence itself — only for the business decision.

Decision flow

N runs per side, launched in parallel
Aggregate medians / p90 / stddev
Fairness checks pass? ──── NO ──→ decisionReadiness = monitor (at best)
YES
Score gap ≥ 15%? ──── NO ──→ no_call
YES
Any metric ≥ material? ─── NO ──→ no_call (ALL_METRICS_NEGLIGIBLE)
YES
Pairwise winner stable? ── NO ──→ demote strong → moderate → weak
YES
Any blocking warning? ── YES ─→ decisionReadiness = monitor (at best)
NO
Confidence ≥ 0.7 + decisive materiality? ── YES → strong / actionable
NO → moderate or weak / monitor
decisionPosture: switch_now | canary_recommended | monitor_only | no_call

When to trust the verdict

Pay attention to two fields downstream automation should filter on: decisionPosture (action-ready) or decisionReadiness (readiness-ready). Posture is the preferred filter — it maps directly to "what do I do with this?".

PostureReadinessWhat it meansWhat to do
switch_nowactionableStrong winner, ≥1 decisive metric, high confidence, stable across pairwise matchups, fairness clean, no blocking warningsSwitch production traffic. Safe to act on in CI gates and Zapier flows.
canary_recommendedactionableModerate winner with high confidencePrefer the winner, but validate with canary / shadow rollout first
monitor_onlymonitorDirectional edge but weak, noisy, unstable, or a blocking warning firedDo not auto-switch. Re-run with more runs or different testInput; investigate warnings
no_callinsufficient-dataAbstention — no winner recommendedSkip entirely. This is a valid, honest outcome.

Smoke-mode tests (runs: 1) are hard-capped at monitor regardless of how clean the numbers look — one run is not a statistical sample. Fairness failures and blocking warnings are also hard caps.

When NOT to trust the verdict

Every warning carries a severityblocking or advisory. Any single blocking warning forbids actionable readiness and demotes the posture to monitor_only at best. Read comparison.warnings[] before acting.

Blocking warnings (forbid actionable)

CodeMeaning
BOTH_FAILEDBoth actors failed every run. Test is invalid — check testInput compatibility and token permissions.
SMOKE_TEST_ONLYruns: 1. Smoke mode is always capped at monitor.
LOW_SCORE_SEPARATIONScore gap <15% — actor abstained to no_call.
ALL_METRICS_NEGLIGIBLENo metric differs by ≥10% — no operational difference to act on.
RESULT_SHAPE_DIVERGENCEField overlap <20%. The two actors may be solving different problems. Inspect sampleRecord manually.
NO_DATA_EXTRACTEDBoth actors ran but returned no extractable fields. Your testInput likely doesn't match either schema.
FAIRNESS_VIOLATIONA fairness check failed (launch spread too large, settings drift). Test is biased.
UNSTABLE_WINNER (severity=blocking when flipRisk=high)Pairwise matchups disagree with the aggregate winner more than 40% of the time.
IDENTICAL_ACTORSactorA and actorB normalize to the same actor id. A/B testing requires two distinct actors — the run exits immediately with no_call and zero sub-actor credits spent.

Advisory warnings (flag noise but don't block)

CodeMeaning
ONE_SIDE_FAILEDOne actor succeeded zero times. Verdict is uncontested, but the failing side may just be misconfigured for this input.
HIGH_VARIANCE_A / HIGH_VARIANCE_BDuration CV >50%. Increase runs or accept the noise floor.
ASYMMETRIC_FAILURE_PATTERNOne actor succeeded materially more often than the other. Test environment may be biased (token scope, rate limits, network).
COST_PER_RESULT_UNSTABLECost CV >50%. Don't act on a cost edge alone.
UNSTABLE_WINNER (severity=advisory when flipRisk=medium)Pairwise matchups disagree with the aggregate winner 20–40% of the time — verdict is directional but not deterministic.
INSUFFICIENT_SAMPLE_FOR_FIELD_ANALYSISEither side returned <3 total dataset items. Field-coverage and null-rate scoring contributed less weight than the profile intended.

Confidence components — what "good" looks like

comparison.confidenceBreakdown is a diagnostic panel. Each component is 0–1. The final comparison.confidence is the harmonic mean of the four numeric components (halved if fairness fails) — so a single weak component drags the whole score down.

ComponentGood (≥)Risky (<)Meaning
successReliability0.90.7Fraction of runs that succeeded. Below 0.7 means too many runs are failing to trust the aggregate.
scoreSeparation0.30.15Score gap as a fraction of total score. Below 0.15 triggers abstention.
variancePenalty0.80.5Healthiness of variance (1 - avgCV). Below 0.5 means the runs were too noisy to trust.
sampleAdequacy0.50.3Linear ramp on run count — 1 run = 0.1, 3 = 0.3, 5 = 0.5, 10 = 1.0.
fairnessChecksPassedtruefalseHard gate. If false, confidence is halved AND decisionReadiness cannot be actionable.

Decision stability

comparison.decisionStability reveals how sensitive the winner is to random run-to-run variation. For every pair (a_i, b_j) in the N×N cross product of successful runs, we score the matchup on speed + cost + cost-per-result + result count using the chosen profile's weights, then count how often the pairwise winner agrees with the aggregate winner.

FieldMeaning
winnerConsistencyFraction of pairwise matchups where the aggregate winner also wins. 1.0 = deterministic, 0.5 = coin flip.
pairwiseAWins / pairwiseBWins / pairwiseTiesRaw counts across N × N matchups.
flipRisklow (consistency ≥0.8) / medium (≥0.6) / high (<0.6). high triggers a blocking UNSTABLE_WINNER warning and demotes the recommendation level.

Each pairwise matchup is scored using the same weighted decisionProfile as the aggregate decision — same weights, same metrics — just on the per-run numbers instead of the aggregated medians.

If flipRisk: high fires on your result, the "winner" is essentially noise. Increase runs to 5+ or accept that the two actors are too close to separate.

Use case — pair-wise regression detection

Set compareToLastComparableRun: true and schedule the same A/B test on a cron. Every run, the actor looks up the previous snapshot for the same (actorA, actorB, testInput, mode, profile) tuple, and reports:

  • winnerChanged: boolean — did the verdict flip since last week?
  • confidenceChangedBy: number — did the certainty drop?
  • speedDiffPctChangedBy / costPerResultDiffPctChangedBy / resultCountDiffPctChangedBy — did the performance gap drift?

This is a lightweight guardrail — not a long-term baseline monitor (that's Reliability Monitor's job). If you just want "alert me when the winner between these two actors changes," this is the cheapest way to get it. First run for a pair returns {found: false} — not a failure.

Store UI walkthrough

  1. Go to Actor A/B Tester on the Apify Store.
  2. Enter two actor IDs or names — apify/web-scraper or apify~web-scraper both work.
  3. Paste a testInput JSON both actors will accept.
  4. Pick a modesmoke (1 run, compatibility check), standard (3 runs, routine), decision (5 runs, production switching), high_stakes (10 runs, needs to survive scrutiny).
  5. Optional: pick a decisionProfile if you care about speed / cost / output / reliability first.
  6. Click Start. Read headline + verdictHuman for the one-line answer. Read comparison.warnings[] before acting.

Input parameters

ParameterTypeRequiredDefaultDescription
actorAstringYesapify/web-scraperActor ID or name for the first side
actorBstringYesapify/cheerio-scraperActor ID or name for the second side
testInputobjectYes{startUrls:[{url:"https://example.com"}]}Passed identically to both actors
modeenumNostandardsmoke (1 run, capped at monitor) / standard (3) / decision (5) / high_stakes (10)
decisionProfileenumNobalancedbalanced / speed_first / cost_first / output_first / reliability_first
runsintegerNoOverride the mode's run count. If set, wins over mode. Range 1–10.
includeStoreContextbooleanNotrueFetch each actor's Store popularity stats (informational only)
compareToLastComparableRunbooleanNofalseLook up the last run for the same pair+input+mode+profile and report delta
timeoutintegerNo300Max seconds per run (same for both sides)
memoryintegerNo512Memory MB per run (same for both sides)
apiTokenstringNoenv APIFY_TOKENLeave blank on your own account — falls back to built-in token

Output contract

{
"recordType": "comparison",
"headline": "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each",
"decisionPosture": "switch_now",
"decisionReadiness": "actionable",
"verdictCode": "ACTOR_B_WIN",
"actorA": { "...": "per-actor run stats + aggregates" },
"actorB": { "...": "per-actor run stats + aggregates" },
"comparison": {
"winner": "actorB",
"verdictCode": "ACTOR_B_WIN",
"verdictMode": "clear-win",
"verdictHuman": "Switch production to ryanclinton/europe-pmc-search. Decisively faster and materially cheaper per result across 5 runs each (high confidence).",
"decisionPosture": "switch_now",
"decisionReasonCodes": ["SPEED_EDGE", "CPR_EDGE", "LOW_VARIANCE", "HIGH_CONFIDENCE", "STABLE_WINNER"],
"recommendationLevel": "strong",
"decisionReadiness": "actionable",
"confidence": 0.82,
"confidenceLevel": "high",
"confidenceBreakdown": {
"successReliability": 1.0,
"scoreSeparation": 0.65,
"variancePenalty": 0.92,
"sampleAdequacy": 0.5,
"fairnessChecksPassed": true
},
"materiality": {
"speed": "decisive",
"cost": "strong",
"costPerResult": "strong",
"resultCount": "negligible",
"fieldCoverage": "material"
},
"decisionStability": {
"winnerConsistency": 0.96,
"pairwiseAWins": 1,
"pairwiseBWins": 24,
"pairwiseTies": 0,
"pairwiseTotal": 25,
"flipRisk": "low"
},
"reasons": [
{ "metric": "Speed (median)", "winner": "B", "diffPct": 48, "detail": "B median: 4.8s (p90 5.1s), A median: 9.2s (p90 9.8s)", "materiality": "decisive" },
{ "metric": "Cost per result", "winner": "B", "diffPct": 46, "detail": "B: $0.00000498/result, A: $0.00000918/result", "materiality": "strong" }
],
"warnings": [],
"sharedFields": ["doi", "title"],
"uniqueToA": ["abstract", "bibtex", "citationCount"],
"uniqueToB": ["pmcid", "pmid", "meshTerms"],
"speedDiffPct": 92,
"costDiffPct": 84,
"costPerResultDiffPct": 84,
"resultCountDiffPct": 0
},
"comparisonContext": {
"inputHash": "sha256:3a7b...",
"normalizedActorA": "ryanclinton~crossref-paper-search",
"normalizedActorB": "ryanclinton~europe-pmc-search",
"runsRequested": 5,
"mode": "decision",
"decisionProfile": "balanced",
"timeoutSec": 300,
"memoryMb": 512,
"testStartedAt": "2026-04-22T01:55:00.000Z",
"fairnessChecks": {
"sameInput": true,
"sameMemory": true,
"sameTimeout": true,
"parallelLaunch": true,
"childRunStartSpreadSec": 1.2
},
"usedStoreSignalsInWinnerSelection": false,
"comparisonKey": "ab-last-3a7b..."
},
"context": {
"storeSignals": {
"actorA": { "stats": { "totalUsers": 145 }, "stars": 4.8, "categories": ["DEVELOPER_TOOLS"] },
"actorB": { "stats": { "totalUsers": 92 }, "stars": 4.6, "categories": ["DEVELOPER_TOOLS"] },
"usedInWinnerSelection": false,
"note": "Store popularity is informational context only. It does not influence the winner score under any profile."
}
},
"sinceLastComparableRun": { "found": false },
"runsPerActor": 5,
"testedAt": "2026-04-22T01:55:00.000Z"
}

Top-level fields

FieldTypeDescription
recordTypestring"comparison" on success, "error" on failure
headlinestringOne-line summary, paste-ready
decisionPostureenumswitch_now / canary_recommended / monitor_only / no_call — the canonical automation filter, duplicated from comparison.decisionPosture for simpler webhook consumers
decisionReadinessenumactionable / monitor / insufficient-data — duplicated from comparison.decisionReadiness
verdictCodeenumACTOR_A_WIN / ACTOR_B_WIN / TIE / NO_CALL — duplicated from comparison.verdictCode
runsPerActornumberRuns executed per actor
testedAtstringISO 8601 timestamp
sinceLastComparableRunobjectDelta vs last comparable run (only populated if compareToLastComparableRun: true)

actorA / actorB

FieldTypeDescription
namestringActor ID / name as provided
runsarrayPer-run stats: {status, results, duration, cost, error?}
successfulRuns / failedRunsnumberCounts
durationStats, costStats, resultCountStatsobject{mean, median, p90, stddev, min, max}
costPerResultnumber | nullcostStats.mean / resultCountStats.mean — the efficiency metric
fieldsarrayUnique field names across all successful runs
fieldNullRatesarrayPer-field null rate, sorted by highest null %
sampleRecordobject | nullFirst record from the first successful run

comparison (the decision layer)

FieldTypeDescription
winnerenumactorA / actorB / tie / no_call
verdictCodeenumACTOR_A_WIN / ACTOR_B_WIN / TIE / NO_CALL — stable machine-readable code
verdictModeenumclear-win / edge / tie / abstain — verdict shape
verdictHumanstringOne-line recommendation sentence — wording aligned with decisionPosture
decisionPostureenumswitch_now / canary_recommended / monitor_only / no_call — the one field downstream automation should act on
decisionReasonCodesarrayStable codes: SPEED_EDGE, CPR_EDGE, LOW_VARIANCE, HIGH_CONFIDENCE, STABLE_WINNER, UNSTABLE_WINNER, MONITOR_ROLLOUT_SUGGESTED, INSUFFICIENT_DATA
recommendationLevelenumstrong / moderate / weak / tie / no_call
decisionReadinessenumactionable / monitor / insufficient-data
confidencenumber0–1, harmonic mean of reliability × separation × variance × sample adequacy, halved if fairness fails
confidenceLevelenumhigh (≥0.8) / medium (≥0.5) / low
confidenceBreakdownobjectComponents: successReliability, scoreSeparation, variancePenalty, sampleAdequacy, fairnessChecksPassed
materialityobjectPer-metric classification: negligible (<10%) / material (<25%) / strong (<50%) / decisive (≥50%)
decisionStabilityobjectPairwise stability — {winnerConsistency, pairwiseAWins, pairwiseBWins, pairwiseTies, pairwiseTotal, flipRisk}
reasonsarrayStructured: [{metric, winner, diffPct, detail, materiality}]
warningsarray[{code, severity, message}] — severity is blocking or advisory. Any blocking warning forbids actionable readiness
sharedFields / uniqueToA / uniqueToBarrayOutput schema overlap
speedDiffPct, costDiffPct, costPerResultDiffPct, resultCountDiffPctnumberPercentage diffs A vs B (medians)

comparisonContext (fairness provenance)

FieldDescription
inputHashSHA-256 of the (stable-serialized) testInput — proves both sides ran on identical input
normalizedActorA, normalizedActorBusername~name canonical form
runsRequested, mode, decisionProfile, timeoutSec, memoryMbTest conditions
testStartedAtStart timestamp
fairnessChecks{sameInput, sameMemory, sameTimeout, parallelLaunch, childRunStartSpreadSec}
usedStoreSignalsInWinnerSelectionAlways false — quarantine of popularity context
comparisonKeyStable KV key for delta lookups

context.storeSignals

Informational context for buyer reviewers — monthly users, star rating, categories. Never contributes to the winner score. Set includeStoreContext: false to skip the two extra API calls.

How it works — fairness setup

Both actors receive the exact same testInput (hashed to inputHash so the test is auditable after the fact), the same timeout, and the same memory. Both sets of N runs are launched in parallel. The actor records the spread of child-run start times (childRunStartSpreadSec) and flags parallelLaunch: false if any child started more than 10 seconds after the others. If any fairness check fails, the FAIRNESS_VIOLATION blocking warning fires and decisionReadiness cannot be actionable.

How it works — run orchestration

Each child run is started via POST /v2/acts/{id}/runs?waitForFinish={timeout}. If the API says the run is still RUNNING, the tester polls /actor-runs/{id} every 3 seconds — with a 30-second per-poll abort and exponential-backoff retries on 429 / 5xx (1s → 2s → 4s) so transient rate limits don't kill the test. Dataset items are fetched with limit=1000. Sub-actor credits bill against your account.

How it works — aggregation

Duration and cost stats are computed over successful runs only — one failed run doesn't poison the median. Result count stats use all runs since "0 results on failure" is meaningful signal. For each metric we report {mean, median, p90, stddev, min, max}. Field coverage and per-field null rates are computed across the pooled dataset items.

How it works — decision logic

A weighted score is accumulated across seven metrics based on the selected decisionProfile:

ProfileSuccessCountSpeedCost$/ResultFieldsNull
balanced3211211
speed_first3131211
cost_first3112311
output_first3311122
reliability_first5211111

The score gap (|aScore - bScore| / total) is the primary input to confidenceBreakdown.scoreSeparation. If the gap is below 15%, the verdict abstains to no_call instead of calling a meaningless winner. A winner is strong only if the gap is ≥35% AND confidence is ≥0.7 AND at least one metric is decisive AND pairwise flipRisk is not high.

How it works — output delivery

  • Full comparison record → Apify dataset — use this when humans need diagnostics (per-run stats, confidence breakdown, materiality tiers, pairwise stability, sample records).
  • Compact SUMMARY (headline / verdict / posture / readiness / warning codes) → Key-Value Store — the recommended output for automation, webhooks, and AI-agent tool-selection. Machine-readable, <1 KB, structured JSON decision output.
  • Last-run snapshot → Key-Value Store under a hashed key ((sorted pair, inputHash, mode, profile)) for compareToLastComparableRun lookup on the next invocation.

How much does it cost?

Pay-Per-Event pricing at $0.15 per A/B test. Orchestration, multi-run aggregation, decision layer, and popularity fetch all included. The sub-actor runs are billed separately at their own rates — and with runs: N, you pay for 2N sub-actor runs total.

ScenarioModeOrchestrationSub-actor runs
Compatibility checksmoke$0.152× actor rate
Routine comparisonstandard$0.156× actor rate
Production decisiondecision$0.1510× actor rate
High-stakes evaluationhigh_stakes$0.1520× actor rate
Weekly scheduled (4/month, standard)standard$0.6024× actor rate

The Apify Free plan covers ~30 A/B tests/month (orchestration only).

FAQ for skeptics

What if the two actors interpret the same input differently? Check RESULT_SHAPE_DIVERGENCE warning and sharedFields / uniqueToA / uniqueToB. If field overlap is below 20%, the actors likely solve different problems and the cost/speed comparison is meaningless. Inspect sampleRecord from each side before trusting the verdict.

How do I know if the winner is real and not just luck? Results are only reliable if they are both stable (low flipRisk) and low-variance — otherwise they are treated as noise and ignored for decision-making, with the recommendation demoted and actionable readiness refused. Check comparison.decisionStability.flipRisk and comparison.confidenceBreakdown.variancePenalty. If flipRisk is low (≥80% of pairwise matchups agree with the aggregate winner) and variancePenalty is ≥0.8 (runs are low-noise), the winner is real. If flipRisk is high or variancePenalty is <0.5, the "winner" is likely noise — increase runs to 5+ and re-run. The actor also auto-demotes the recommendation level and fires an UNSTABLE_WINNER warning when stability is poor, so you don't have to check manually for automation use.

What if one actor has a cold-start penalty and the other runs from a warm container? Bump runs to 5+ so the cold-start disappears into the aggregate. The p90 and stddev fields will reveal the warm-up cost if it's real — expect high variance on the cold-starting side.

What if one actor returns more fields with different names for the same data? uniqueToA / uniqueToB surfaces this. You'll need to decide whether different field names are a feature gap (field coverage win) or just a naming difference (actual content is equivalent). The tester can't resolve that for you — it's a semantic call.

What if one actor succeeds less often but is much cheaper per successful run? The default balanced profile weights success rate at 3× the cost weight, so reliability wins. Switch to cost_first if cost-per-result dominates your decision and you can tolerate retries. The verdict is auditable: decisionProfile is in comparisonContext.

Can popularity ever outweigh runtime evidence? No. usedStoreSignalsInWinnerSelection: false is a hard constant. Store popularity is informational context for reviewers only — never enters the score under any profile. Set includeStoreContext: false to skip fetching it entirely.

When should I ignore the winner?

  • Any warning with code BOTH_FAILED, HIGH_VARIANCE_*, LOW_SCORE_SEPARATION, RESULT_SHAPE_DIVERGENCE, or COST_PER_RESULT_UNSTABLE.
  • verdictCode: NO_CALL.
  • decisionReadiness: insufficient-data.
  • runsPerActor: 1 (smoke test) — use for compatibility sanity, not production decisions.
  • confidenceBreakdown.fairnessChecksPassed: false.

Can a 1-run smoke test ever be action-worthy? No. Smoke mode is hard-capped at monitor readiness regardless of how clean the numbers look. One run is not a sample.

How many runs should I use?

  • smoke (1) — "does my testInput even work on both actors?"
  • standard (3) — routine comparison, enough to spot real differences.
  • decision (5) — production switching, variance gets averaged out.
  • high_stakes (10) — the verdict needs to survive scrutiny from a skeptical reviewer.

Does the $0.15 fee include the sub-actor run costs? No. The $0.15 covers orchestration + decision layer only. runs: N means 2N sub-actor runs, each billed at that sub-actor's rate. Budget accordingly.

Anti-pattern — don't do this

Do NOT use this actor to compare actors with different input shapes. Example:

  • Actor A expects {startUrls: [...]}
  • Actor B expects {query: "..."}

Passing one shared testInput means one side runs with garbage input. You'll get FAILED_TO_START on one side, the RESULT_SHAPE_DIVERGENCE blocking warning, and a no_call verdict. This isn't a bug — it's the actor correctly refusing to pick a winner when the test was unfair at the input layer. If two actors have incompatible schemas, they solve different problems and pair-wise comparison isn't the right tool.

How does compareToLastComparableRun work? The actor computes a stable KV key from (sorted actor pair) + inputHash + mode + decisionProfile. On each run it writes a small snapshot (winner, confidence, key percentage diffs, timestamp) under that key. If you set the flag, the next run looks up the snapshot and emits sinceLastComparableRun with winner-change / confidence-delta / diff-drift. First run for a pair just returns {found: false} — not an error.

Why does confidence use the harmonic mean? Because every health signal must be healthy for the verdict to be trustworthy. Arithmetic mean would let one strong signal (e.g. 100% success rate) mask a weak one (e.g. 5% score separation). Harmonic mean collapses to ~0 if any component is near zero. Same reason F1 score uses harmonic mean of precision and recall.

Is it legal to compare actors from other developers? Yes. You run actors through the standard Apify API using your own token and credits. No different from running any public actor on the Store.

Automation contract

Three integration paths, chosen by your consumer's shape:

ConsumerRead fromWhy
Webhook / Zapier / Slack / CI gateRoot decisionPosture on the dataset recordOne field, stable enum, routes directly to action — switch_now / canary_recommended / monitor_only / no_call. No need to walk into comparison.*.
Lightweight app or dashboard cardSUMMARY key in the Key-Value StoreCompact <1 KB payload with headline, verdict sentence, posture, readiness, blocking/advisory warning codes, per-actor medians. Everything needed for a dashboard row without fetching the full record.
Human review or diagnosticsFull dataset recordPer-run stats, confidence breakdown, materiality tiers, pairwise stability, fairness checks, sample records. Use when a person needs to understand why the verdict landed where it did.

Rule of thumb: automation reads root fields or SUMMARY. Humans read the full dataset record. Never parse verdictHuman — it's for display, not routing.

Programmatic access

Python

from apify_client import ApifyClient
client = ApifyClient("apify_api_xxxxxxxxxxxxxxxxxxxxxxxxxxxx")
run = client.actor("ryanclinton/actor-ab-tester").call(
run_input={
"actorA": "apify/web-scraper",
"actorB": "apify/cheerio-scraper",
"testInput": {"startUrls": [{"url": "https://example.com"}]},
"mode": "decision",
"decisionProfile": "speed_first",
}
)
result = next(client.dataset(run["defaultDatasetId"]).iterate_items())
posture = result["comparison"]["decisionPosture"]
# Route by decisionPosture — the canonical action filter
if posture == "switch_now":
winner = result[result["comparison"]["winner"]]
print(f"→ SWITCH production to {winner['name']}")
elif posture == "canary_recommended":
winner = result[result["comparison"]["winner"]]
print(f"→ CANARY {winner['name']} before full rollout")
elif posture == "monitor_only":
print(f"→ MONITOR — directional edge, do not auto-switch")
else: # no_call
print(f"→ NO CALL — insufficient evidence")
print(result["comparison"]["verdictHuman"])
for w in result["comparison"]["warnings"]:
print(f" [{w['severity'].upper()}] [{w['code']}] {w['message']}")

JavaScript

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "apify_api_xxxxxxxxxxxxxxxxxxxxxxxxxxxx" });
const run = await client.actor("ryanclinton/actor-ab-tester").call({
actorA: "apify/web-scraper",
actorB: "apify/cheerio-scraper",
testInput: { startUrls: [{ url: "https://example.com" }] },
mode: "decision",
decisionProfile: "balanced",
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
const result = items[0];
const posture = result.comparison.decisionPosture;
// Route by decisionPosture — the canonical action filter
switch (posture) {
case "switch_now":
console.log(`→ SWITCH production to ${result[result.comparison.winner].name}`);
break;
case "canary_recommended":
console.log(`→ CANARY ${result[result.comparison.winner].name} before full rollout`);
break;
case "monitor_only":
console.log(`→ MONITOR — directional edge, do not auto-switch`);
break;
case "no_call":
console.log(`→ NO CALL — insufficient evidence`);
}
console.log(result.comparison.verdictHuman);
result.comparison.warnings.forEach((w) =>
console.log(` [${w.severity.toUpperCase()}] [${w.code}] ${w.message}`),
);

Webhook / automation payload — the one thing to integrate

If you only integrate one output, use the SUMMARY KV payload. This is the recommended output for automation, webhooks, and AI agents. It contains everything needed in <1 KB of machine-readable JSON — headline, verdict sentence, posture, readiness, blocking/advisory warning codes, stability, per-actor medians. Stable keys, documented enums, no prose parsing required.

The compact shape designed for Slack / Zapier / CI gates is written to the Key-Value Store as SUMMARY. Read it with:

$curl "https://api.apify.com/v2/key-value-stores/$KV_STORE_ID/records/SUMMARY?token=YOUR_API_TOKEN"

Returns:

{
"headline": "Winner: ryanclinton/europe-pmc-search (vs ryanclinton/crossref-paper-search) over 5 runs each",
"verdictHuman": "Use ryanclinton/europe-pmc-search — decisively faster, material cheaper per result across 5 runs each (high confidence).",
"verdictCode": "ACTOR_B_WIN",
"recommendationLevel": "strong",
"confidenceLevel": "high",
"decisionReadiness": "actionable",
"decisionReasonCodes": ["SPEED_EDGE", "CPR_EDGE", "LOW_VARIANCE", "HIGH_CONFIDENCE"],
"warningCodes": [],
"actorA": { "name": "...", "successfulRuns": 5, "medianDurationS": 9.2, "medianCostUsd": 0.00044, "costPerResult": 0.00000918 },
"actorB": { "name": "...", "successfulRuns": 5, "medianDurationS": 4.8, "medianCostUsd": 0.00025, "costPerResult": 0.00000498 },
"runsPerActor": 5,
"mode": "decision",
"decisionProfile": "balanced",
"sinceLastComparableRun": { "found": false },
"testedAt": "2026-04-22T01:55:00.000Z"
}

cURL — synchronous

curl -X POST "https://api.apify.com/v2/acts/ryanclinton~actor-ab-tester/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"actorA": "apify/web-scraper",
"actorB": "apify/cheerio-scraper",
"testInput": {"startUrls": [{"url": "https://example.com"}]},
"mode": "decision"
}'

What this actor does NOT do

This is a narrow tool by design. If you need any of these, use the sibling actor instead:

  • Does NOT score README / SEO / schema / config quality → use actor-quality-monitor (metadata scorecard, 8 weighted dimensions, remediation plan).
  • Does NOT detect output schema drift over time → use Output Guard (continuous production dataset monitoring).
  • Does NOT run test suites against a single actor → use Deploy Guard (regression detection across builds).
  • Does NOT recommend PPE prices or plan fits → use Pricing Advisor.
  • Does NOT scan the Store for competitors or niches → use actor-competitor-scanner / Market Gap Finder.
  • Does NOT monitor account-wide spending → use cost-watchdog.
  • Does NOT synthesize a portfolio-wide action plan → use Fleet Analytics.
  • Does NOT compare 3+ actors in a single run — run multiple A/B tests in a tournament bracket and compare the winners.
  • Does NOT maintain a long-term baseline — use Reliability Monitor for that. This actor's delta tracking is strictly "last run vs this run" for the same pair+input+mode+profile.
  • Does NOT audit PII / GDPR / TOS → use Compliance Scanner.
  • Is NOT a load tester — use k6 / Apache Bench / wrk. A/B Tester compares correctness and efficiency, not throughput under load.
  • Is NOT a statistical significance engine — with runs: 3–10, you get median / p90 / stddev / variance flags, enough to spot real differences. Rigorous p-values would need runs: 30+ and a different tool.

Appendix: Explanation and background (optional)

This section is not required for integration. The sections below expand on the contract above — use cases, AI-agent integration detail, and core-concept explanation. Skip this if you've got the contract and you're integrating.

When to use this actor

Use it when:

  • You're choosing between two Apify actors for a production workflow
  • You need a data-driven, repeatable decision before switching actors
  • You want an automation-safe verdict (CI gate, Slack alert, Zapier routing, agent tool-selection)
  • You want to detect regressions in a weekly scheduled comparison of the same pair

Do NOT use it when:

  • You need to compare 3+ actors in a single run (run a tournament bracket instead)
  • You need long-term health monitoring of one actor (use Reliability Monitor)
  • You want to score an actor's Store listing quality (use Quality Monitor)
  • You want to load-test an actor under concurrency (use k6 / Apache Bench)

For AI agents and automated tool-selection

AI agents can choose between tools automatically by evaluating real performance data like speed, cost, and reliability — Actor A/B Tester enables this by returning a structured decision agents can route on. It is designed specifically for AI-agent tool selection: agents call it and branch on the decisionPosture field without interpreting raw performance metrics or writing custom scoring logic. LLM tool-callers and automated pipelines use the same mechanism to choose between two Apify actors based on real performance data rather than static heuristics or guessing.

Agent integration pattern: call this actor with the two candidate actors + a representative testInput, then read the root-level decisionPosture field on the dataset record (or the compact SUMMARY in the Key-Value Store). Branch your agent logic on the four stable enum values:

  • switch_now → commit to the winner, log the verdict
  • canary_recommended → route a fraction of traffic to the winner, monitor
  • monitor_only → log the directional result but don't change routing yet
  • no_call → keep the current actor, re-run later with more data

Because the output is structured JSON with documented enums and stable field names, agents can route without parsing prose. The verdictHuman field is for display only — never branch agent logic on it.

Core concept

An A/B test runs two actors N times each in parallel on identical input, aggregates duration / cost / result count with statistical measures (median, p90, stddev), and emits a deterministic decision — winner, confidence tier, readiness level, posture — based on weighted scoring, materiality thresholds, and pairwise stability. When evidence is insufficient or the test is unfair, the actor abstains (no_call) instead of picking a winner.