Actor Test Runner — Validate Inputs, Outputs & Error Handling avatar

Actor Test Runner — Validate Inputs, Outputs & Error Handling

Pricing

$350.00 / 1,000 test suite runs

Go to Apify Store
Actor Test Runner — Validate Inputs, Outputs & Error Handling

Actor Test Runner — Validate Inputs, Outputs & Error Handling

Actor Test Runner. Available on the Apify Store with pay-per-event pricing.

Pricing

$350.00 / 1,000 test suite runs

Rating

0.0

(0)

Developer

ryan clinton

ryan clinton

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Deploy Guard — Release Intelligence & Regression Detection

Deploy Guard is the pre-deploy release gate in an Apify actor execution lifecycle — it runs automated test suites against a candidate build and returns a release decision (act_now / monitor / ignore) that downstream automation can branch on, without parsing prose. It's the pre-push CI gate that converts "I think the new build works" into a routable, confidence-scored verdict.

Who it's for: Apify developers who push builds on a schedule, CI/CD operators who gate deploys on a verdict, and LLM agent tool calls that need a machine-readable answer — not a dashboard.

Branch on decision (string enum). Never parse the summary or explanation prose — the string format is not stable.

Automatically run tests and block deployment if your scraper or Apify actor breaks.


What it does — in plain language

Automatically run tests and block deployment if your scraper or Apify actor breaks.

Deploy Guard runs automated test suites on an Apify actor and returns a deterministic decision (act_now, monitor, or ignore) that you can use to deploy or block a release.

  • Automatically block a deployment if your Apify actor breaks. Deploy Guard returns a machine-readable decision field (act_now / monitor / ignore) you can branch on directly in your CI pipeline to fail the build.
  • Fail your CI pipeline automatically if your actor output is invalid. Any failing assertion (min results, required fields, regex, duration, uniqueness) surfaces in verdictReasonCodes and flips the verdict to block.
  • Use this to stop bad scraper builds from being deployed. Pre-push, scheduled, or on-demand — Deploy Guard catches regressions before they reach production users.
  • Check if your Apify actor is safe to deploy. Deploy Guard answers one question — "Is this Apify actor build safe to deploy?" — and returns a deterministic yes/no decision for CI/CD gating.
  • Automatically detect regressions in your scraper. Deploy Guard compares each run to a stored baseline and flags breaking changes, missing fields, null-rate shifts, drift, and flaky tests.
  • A testing tool for Apify actors that runs automated test suites. Presets cover common shapes (canary, scraper-smoke, api-actor, contact-scraper, ecommerce-quality, store-readiness); custom test cases handle the rest.
  • Designed for AI agents to decide whether to deploy or block a build. The output is a structured, additive-only enum — an LLM agent reads decision + decisionDrivers[] and acts without parsing prose.
  • Check if your code or Apify actor is safe to deploy. The same pattern works whether you're gating an actor build, a data-pipeline change, or any job that ends in a dataset assertion.
  • Automatically test Apify actors by running test suites and validating their output. Presets, custom test cases, and parameterized templates all run through the same Actor.call()checkAssertions() → decision pipeline.
  • Use this in CI/CD pipelines to automatically deploy or block releases. Parse decision from the synchronous run result; exit non-zero unless it's act_now + status: pass.
  • Run automated tests and decide whether to deploy or block a build. Deploy Guard is the decision layer — it turns raw test output into a single enum your pipeline can branch on.
  • Stop broken builds from being deployed. The release gate short-circuits on critical test failures, breaking drift, or cold-start; decision becomes monitor or block automatically.
  • Turn test results into a deploy-or-block decision in CI/CD. Raw assertion output gets compressed into one decision enum + one confidence score — your pipeline never has to reason about individual test failures.
  • Catch broken builds before they reach production. Pre-push, baseline-aware, drift-detecting — Deploy Guard surfaces the regression before a customer-facing run does.

How it works in 4 steps

  1. Run test cases against the target Apify actor via Actor.call()
  2. Validate output against assertions (min/max results, required fields, field types, regex, duration, uniqueness)
  3. Compare against baseline for drift, flakiness, and trust-trend signals (when enableBaseline: true)
  4. Return a deploy/block decisionact_now / monitor / ignore plus confidence score, stable verdict codes, and a priority-ranked remediation list

Tool Contract (LLM-first)

{
"tool": "Deploy Guard",
"description": "Deploy Guard runs automated test suites on an Apify actor and returns a deterministic decision (act_now, monitor, or ignore) that you can use to deploy or block a release.",
"short_description": "Automatically run tests and block deployment if your scraper or Apify actor breaks.",
"primary_use": "Pre-release CI/CD gate for a single Apify actor build",
"call_via": "POST https://api.apify.com/v2/acts/ryanclinton~actor-test-runner/run-sync-get-dataset-items",
"input": {
"targetActorId": "string (required) — 'username/actor-name' or raw actor ID",
"preset": "string (optional) — one of: canary, scraper-smoke, api-actor, contact-scraper, ecommerce-quality, store-readiness",
"testCases": "array (optional) — custom { name, input, assertions }[]",
"enableBaseline": "boolean (optional) — unlocks drift + flakiness + trust trend"
},
"output": {
"decision": "'act_now' | 'monitor' | 'ignore'",
"decisionReason": "string (illustrative, do not parse)",
"decisionDrivers": "string[] (top 3 stable codes, ranked by score impact)",
"confidenceLevel": "'high' | 'medium' | 'low'",
"score": "integer 0-100",
"verdictReasonCodes": "string[] (stable enum)",
"confidenceFactorCodes": "string[] (stable enum)"
},
"guarantees": [
"decision is always present on every record (including error / ignore paths)",
"decision = act_now is never produced without a trusted baseline",
"score capped at 70 during cold-start",
"stable enums are additive-only within a major version",
"prose fields (decisionReason, statusHeadline, oneLine, summary) are not stable"
],
"routing": "Branch on `decision`. If `act_now` + `status: pass` → deploy. If `act_now` + `status: block` → halt. Otherwise → human review."
}

When to use this tool

Deploy Guard runs automated test suites on an Apify actor and returns a deterministic decision (act_now, monitor, or ignore) that you can use to deploy or block a release. Reach for it when:

  • "Is this actor build safe to deploy?" → run with preset: canary or a custom testCases[] array, check decision
  • "Gate my CI/CD on a deterministic release verdict" → call from GitHub Actions / GitLab CI / Jenkins, exit non-zero unless decision === 'act_now' && status === 'pass'
  • "Detect regressions before publishing a new build" → run with enableBaseline: true on a schedule, read driftSeverity.breaking[] and trendSignals
  • "Surface release health in a Slack channel" → post statusHeadline or oneLine, colour by decision
  • "Let an LLM agent decide whether to promote a build" → the agent reads decision + decisionDrivers[] + decisionReason (one-line summary) and acts

Do NOT use this to: score Store-readiness / README quality / agent-readiness (that's Quality Monitor), compare two actor versions (use A/B Tester), monitor production datasets (use Output Guard).


5-second read — decision field

decisionWhat it meansWhat automation should do
act_nowVerdict is trusted (pass or block) with medium+ confidence AND a trusted baselineDeploy (on pass) or halt the pipeline (on block). Safe to fire Slack/PagerDuty/webhook.
monitorCold-start, low confidence, or a warn verdictDo NOT auto-deploy. Notify a human reviewer.
ignoreNo tests were executedMisconfiguration — no preset and no custom test cases. Investigate input.

Cold-start guarantee: without a trusted baseline (first run, or baseline disabled), decision is never act_now. The confidence score is capped at 70 and confidenceFactorCodes carries cold_start_cap.


Stable machine contract vs illustrative copy

Deploy Guard separates what is guaranteed stable for automation from what's human-facing prose.

Stable (additive-only within a major version — safe to branch on):

  • decision enum: act_now / monitor / ignore
  • confidenceLevel enum: high / medium / low
  • status enum: pass / warn / block
  • verdictReasonCodes[] — additive enum (documented below)
  • confidenceFactorCodes[] — additive enum (documented below; includes low_suite_coverage when suiteCoverage.score < 60)
  • decisionDrivers[] — ranked subset of the above (top 3, impact-ordered)
  • scoreBreakdown.deductions[].code — additive enum (CRITICAL_TEST_FAILURE, WARNING_TEST_FAILURE, BASELINE_DRIFT_BREAKING, BASELINE_DRIFT_NONBREAKING, LOW_SAMPLE_SIZE, SMALL_HISTORY, LOW_SUITE_COVERAGE, FLAKY_TEST)
  • suiteLint.status enum: 'pass' | 'warn' | 'fail'
  • suiteLint.issues[].severity enum: 'error' | 'warning' | 'info'
  • suiteLint.issues[].code — additive enum (NO_TESTS_SUPPLIED, SINGLE_INPUT_VARIANT, NO_DURATION_GUARD, NO_CRITICAL_CHECKS, SINGLE_TEST_BUT_CI_GATING_HINT)
  • trendSignals[] — additive-only enum (known entries: confidence_regression_fast / _moderate / _slow, confidence_improving_fast / _moderate / _slow, flaky_tests_present, flakiness_clean, breaking_drift_detected, schema_expanding_noncritical, execution_fast_all_tests)
  • driftSeverity tiers: breaking / nonBreaking / informational / expected
  • fleetSignals[].code — additive enum (documented in dataset schema)
  • confidenceBreakdown sub-bands: same high / medium / low enum
  • context.progress enum: cold-start / emerging / developing / mature
  • remediation[].type enum: schema_drift / assertion_failure / flaky_test / low_coverage / missing_baseline / suite_design
  • Dataset field names + types (declared in dataset_schema.json)

Illustrative only — format may evolve, do NOT parse:

  • decisionReason, statusHeadline, oneLine, summary, explanation
  • releaseDecision.recommendation, releaseDecision.reason
  • Status messages (setStatusMessage)
  • Log lines
  • recommendations[] strings

If you need to react to something the prose contains, look for a machine code instead.


Why this beats Apify's daily default-input test

Apify's built-in default-input test runs your actor with {} once a day and flips it to UNDER_MAINTENANCE after 3 consecutive failures. That's a single binary signal — no assertion detail, no drift, no confidence score, no per-field forensics, no CI hook. Deploy Guard runs a full assertion suite against arbitrary inputs, compares against a stored baseline, emits a routable decision tag, produces GitHub/HTML/JSON reports, and calibrates confidence over time. Default-input test is the floor. Deploy Guard is the gate.


How it works

  1. You call Deploy Guard with a target actor ID and either a preset (e.g. canary) or an array of custom test cases
  2. For each test case, Deploy Guard runs the target actor via Actor.call() with the test's input, memory, and timeout
  3. Dataset items from the child run are validated against assertions (min/max results, required fields, field types, regex patterns, duration limits, uniqueness, ranges)
  4. With enableBaseline: true, Deploy Guard compares the run's field schema against a stored baseline — flagging new/missing fields, type changes, null-rate shifts, and test flakiness
  5. The release decision is derived from: critical failures, warning failures, drift significance, trust trend, confidence factors
  6. The decision scalar is computed from verdict + confidence level + baseline trust, then emitted alongside stable machine codes

One dataset item per run (the TestSuiteReport), plus three records in the default key-value store: SUMMARY (JSON, flattened decision layer), GITHUB_SUMMARY (Markdown, text/markdown), and HTML_REPORT (HTML, text/html).


Presets

Pick a preset or write custom test cases — both can run together in the same suite.

PresetBest forWhat it runs
canaryPre-push confidence checkSingle fast test with default input, under 10 seconds
scraper-smokeBasic crawler healthDefault input, checks results exist, 120s timeout
api-actorAPI wrapper validationDefault input, response structure + timing checks
contact-scraperEmail extractorsEmail format regex, domain validation, richness checks
ecommerce-qualityProduct scrapersPrice is number ≥0, URL is https, title non-empty, unique URLs
store-readinessPre-publish auditDefault input produces output, performance guardrail (120s)

When both a preset and testCases are supplied, Deploy Guard runs both. Total child runs = preset test count + custom test count.


Input schema — the 5 inputs that matter

{
"targetActorId": "username/actor-name",
"preset": "canary",
"testCases": [
{
"name": "Smoke — default input",
"input": {},
"assertions": { "minResults": 1, "maxDuration": 120 }
}
],
"enableBaseline": true,
"timeout": 300
}
  • targetActorId (required) — username/actor-name or the raw actor ID
  • preset — one of the 6 presets above, or omit for custom-only runs
  • testCases — array of { name, input, assertions, expectedToFail?, schemaContract? }
  • enableBaseline — opt into baseline + drift + flakiness + trust trend; activates the cold-start → emerging → developing → mature maturity progression
  • timeout — seconds per test (default 300, max 3600); each child run is wrapped in a wall-clock guard at timeout + 60s

Also supported: parameterizedTestCases for {{placeholder}} templating across parameter sets, memory (MB per child run, default 512), maxSampleItems (default 1000, max 10000 for full-scan mode), fieldImportanceProfile for per-field severity overrides (drives criticalityImpact and driftSeverity tiering).

Assertion reference: minResults, maxResults, maxDuration, requiredFields[], fieldTypes{field: 'string'|'number'|'boolean'|'array'|'object'}, noEmptyFields[], fieldPatterns{field: regex}, fieldRanges{field: {min, max}}, uniqueFields[], severity: 'critical'|'warning'.

Waivers + expected instability. Real CI pipelines need controlled exceptions without silently hiding regressions:

  • Per test case: expectedFlaky: true (test is known non-deterministic; don't weight toward flakiness penalty), allowedDriftFields: ["badgeText"] (tolerate drift on listed fields), temporaryWaiverUntil: "2026-05-15T00:00:00.000Z" (scoped waiver with expiry), waiverReason: "site rollout in progress" (audit trail).
  • Global: top-level waivers: [{ testName, allowedDriftFields, temporaryWaiverUntil, reason }] mirrors the same shape, applied by test name. Expired waivers are ignored automatically.
  • Effect: fields matched by an active waiver land in driftSeverity.expected[] instead of breaking / nonBreaking, so the decision engine doesn't punish intentional change.

Output — decision layer first

Every run emits a single TestSuiteReport to the default dataset. Read the decision-layer fields first.

Decision layer (machine-routable — branch on these)

FieldTypeDescription
decision'act_now'|'monitor'|'ignore'Routable decision tag. Never parse prose — branch on this.
decisionReasonstringOne-line plain-language justification. Usable in logs, alerts, audit trails.
decisionDriversstring[]Top 2–3 stable codes ranked by impact on the final decision — surface these in CI logs and Slack alerts.
confidenceLevel'high'|'medium'|'low'Banded from score: high ≥75, medium ≥50, low <50.
confidenceBreakdownobjectSub-bands: executionConfidence / schemaConfidence / historyConfidence / suiteDesignConfidence, each high/medium/low. Tells you why confidence is what it is.
confidenceFactorCodesstring[]Stable codes explaining the confidence score. Additive-only enum.
verdictReasonCodesstring[]Stable codes behind the pass/warn/block verdict. Additive-only enum.
statusHeadlinestringHuman-readable one-liner (e.g. SAFE TO DEPLOY — 5/5 passed (high confidence)).
oneLinestringActor-name-prefixed summary for Slack, email subjects, agent summaries.
contextobject{ progress, progressMessage, hasTrustedBaseline, runCount } — learning maturity.

Explainability layer (read these to understand why the decision landed)

FieldTypeDescription
scoreBreakdownobjectAuditable scoring — { startingScore: 100, deductions[], caps[], finalScore }. Each deduction carries a code, points, count, and reason. Same inputs always produce the same breakdown.
remediationRemediationItem[]Priority-ranked fix cards. Each has whyItMatters + suggestedFix + ownerHint + affectedFields. Read top-down to fix highest-impact issues first.
suiteLintobjectPre-execution lint of the test suite definition itself — catches NO_TESTS_SUPPLIED, SINGLE_INPUT_VARIANT, NO_DURATION_GUARD, NO_CRITICAL_CHECKS, SINGLE_TEST_BUT_CI_GATING_HINT. Fails fast on suite design problems before burning compute.
suiteCoverageobject{ score, assertionTypesUsed[], blindSpots[], testCount, hasSchemaContract }. Guards against false confidence from a thin suite.
driftSeverityobjectDrift findings tiered: breaking / nonBreaking / informational / expected. Breaking = required or critical-importance field removed or type-changed. Expected = field listed in allowedDriftFields or a test waiver.
criticalityImpactobject{ criticalFieldsHealthy, criticalFieldFailures, nonCriticalFieldFailures, affectedCriticalFields[] }. Derived from fieldImportanceProfile.
regressionSummaryobject{ direction: 'better'|'worse'|'stable', velocity, confidence }. Null until ≥2 prior confidence snapshots exist.
trendSignalsstring[]Compact trend codes: confidence_regression_moderate, flaky_tests_present, breaking_drift_detected, execution_fast_all_tests.
fleetSignalsFleetSignal[]Stable machine codes for fleet-wide aggregation. Additive-only enum: SCHEMA_DRIFT_CRITICAL, SCHEMA_DRIFT_NONCRITICAL, TEST_FLAKY, LOW_SUITE_COVERAGE, CRITICAL_FIELD_FAILURE, CONFIDENCE_REGRESSION, RELEASE_BLOCKED.

confidenceFactorCodes vocabulary (additive-only — new codes may arrive; existing codes won't be renamed or removed within a major version):

  • cold_start_cap — no trusted baseline; confidence capped at 70
  • low_sample_size — fewer than 3 test cases executed
  • small_history — run history exists but has fewer than 5 prior runs
  • healthy_history — trusted baseline + zero failures this run
  • drift_detected — current field schema differs materially from baseline
  • low_suite_coverage — suite exercises fewer than 60% of the assertion surface (coverage score <60)
  • suite_lint_failed — pre-execution lint blocked the run

verdictReasonCodes vocabulary:

  • VERDICT_PASS / VERDICT_WARN / VERDICT_BLOCK — raw status
  • CRITICAL_TEST_FAILURE / WARNING_TEST_FAILURE — per-severity failure counts
  • BASELINE_DRIFT — drift detected against prior baseline
  • COLD_START — no trusted baseline yet
  • SUITE_LINT_FAILED — pre-execution lint failed; no tests ran (paired with decision: 'ignore')
  • NO_TESTS — no preset and no custom test cases supplied (paired with decision: 'ignore')

Fleet signals (for downstream aggregators)

fleetSignals[] is a stable-code array designed for Fleet Analytics / Slack routing / Zapier. Every entry carries { code, severity, scope, actionability, detail?, field? }. The enum is additive-only within a major version.

CodeSeverityScopeMeaning
SCHEMA_DRIFT_CRITICALcriticalfieldBreaking drift on a required or critical-importance field
SCHEMA_DRIFT_NONCRITICALinfosuiteNon-breaking drift across one or more fields
TEST_FLAKYwarningtestIndividual test's historical pass rate below 80%
LOW_SUITE_COVERAGEwarningsuiteCoverage score < 60 — suite has blind spots
CRITICAL_FIELD_FAILUREcriticalfieldAssertion failed on a fieldImportanceProfile.critical field
CONFIDENCE_REGRESSIONwarningrunRecent confidence scores trending down
RELEASE_BLOCKEDcriticalrunVerdict is block — do not promote

Verdict + analytics (existing fields)

FieldTypeDescription
status'pass'|'warn'|'block'Raw verdict before the decision layer. Use decision for automation, status for display.
scoreinteger 0–100Composite confidence score. Capped at 70 during cold-start.
summarystringPlain-language explanation. Not machine-stable.
recommendationsstring[]Suggested next actions derived from the failure mix.
signalsobject{ errorCount, warningCount, criticalCount, driftDetected, metrics }
actorName / actorIdstringThe tested actor's display name + ID.
totalTests / passed / failed / expectedFailuresintegerCount breakdown.
totalDurationnumberSeconds across all test cases.
resultsTestCaseResult[]Per-test: assertions, schema contract, duration, forensics, error classification.
releaseDecisionobjectFull detail: root cause, prioritised failures, actions, trust trend, regression velocity, early warnings, blind spots, suite health.
driftDriftReport | nullField-level diff vs previous baseline. Null until enableBaseline is on + a baseline exists.
stabilityTestStability[] | nullPer-test pass rate + flakiness flag. Null on cold-start.
historyRunSnapshot[] | nullLast 20 run snapshots. Null on cold-start.
detectedActorTypestringHeuristic: scraper / contact-scraper / api-actor / ecommerce / unknown.
suggestedPresetstring | nullPreset that would give richer validation for the detected type.
testedAtISO 8601Timestamp of test completion.

Key-value store outputs

  • SUMMARY — flattened decision layer + counts + failed tests + context (dashboards should read this)
  • GITHUB_SUMMARY — Markdown ready for $GITHUB_STEP_SUMMARY in Actions
  • HTML_REPORT — standalone HTML ready to upload as a CI artefact

Automation contract

ConsumerRead this fieldWhy
Slack / PagerDuty routerdecision + statusHeadlineEnum routing, headline as alert title
CI/CD gate (GitHub Actions, etc.)decision (exit 0 only on act_now + status: pass)Stable enum, no prose parsing
LLM agent tool calloneLine + verdictReasonCodesOne-liner for the model, codes for deterministic follow-up
Human debuggingreleaseDecision.rootCause + results[].forensicsTraces back to the failing assertion

Decision invariants

Deploy Guard enforces these in code — downstream consumers can rely on them without defensive checks:

decision = act_now implies:
context.hasTrustedBaseline = true
confidenceLevel != 'low'
status != 'warn'
totalTests > 0
suiteLint.status != 'fail'
decision = monitor implies at least one of:
context.hasTrustedBaseline = false (cold-start)
confidenceLevel = 'low'
status = 'warn'
decision = ignore implies:
totalTests = 0 OR suiteLint.status = 'fail'
To disambiguate why ignore fired, read verdictReasonCodes:
'SUITE_LINT_FAILED' → suite was invalid, zero tests executed
otherwise → preset + custom testCases both empty
decisionDrivers contract:
- max length = 3
- ordered by absolute score-impact points (higher first)
- ties broken by alphabetical code
- empty only when: decision = act_now + healthy history, OR decision = ignore
(ignore paths already surface their reason via verdictReasonCodes:
'NO_TESTS' or 'SUITE_LINT_FAILED')
remediation[] ordering (deterministic across runs):
1. severity (critical > warning > info)
2. score impact (per DEDUCTION_POINTS table)
3. presence of affected-field list
4. stable tie-break by type
items[].priority reflects this 1..N order after sort.

Decision flow

Input ─────▶ Resolve test cases (preset + custom + parameterized)
Run each test via Actor.call() ◀── 5-consecutive-failure
listItems() circuit breaker (cost guard)
checkAssertions()
computeReleaseDecision
(root cause, trust trend,
drift, stability, suite health)
hasTrustedBaseline ?
╱ ╲
no yes
▼ ▼
score = min(score, 70)
+ cold_start_cap code │
╲ ╱
▼ ▼
confidenceLevel = band(score)
decision:
ignore (totalTests = 0)
monitor (cold-start OR low confidence OR warn verdict)
act_now ((pass or block) + medium/high + trusted baseline)
pushData → setStatusMessage →
KV SUMMARY / GITHUB_SUMMARY / HTML_REPORT
AQP store (field-rule suggestions for Output Guard)

When to trust the decision

ScenariodecisionConfidenceAction
5+ prior runs, pass, high confidence, no driftact_nowhighDeploy
5+ prior runs, block, critical failure, high confidenceact_nowhighHalt + investigate
First run evermonitor≤70 (capped)Review manually; run establishes baseline
Drift detected on a previously-stable fieldmonitor or act_nowvariesInspect drift.changeSummary — may be intentional
1 flaky test in a 5-test suiteact_nowmediumAcceptable if expectedToFail: true

When NOT to trust the decision

ScenarioWhyWhat to do instead
monitor + cold_start_cap codeNo baseline context yetRun on a schedule for 5+ iterations before gating CI
verdictReasonCodes contains BASELINE_DRIFTPrior schema has changedInspect drift; may be intentional or regression
Single test in the suitelow_sample_size codeAdd at least 3 tests; cold-start math dominates with one
Flakiness in stabilityOne test's pass rate < 80%Fix the flake or mark expectedToFail: true
Fewer than 5 runs in historysmall_history codeTrust trend is still warming up — wait for maturity

Failure interpretation cheat sheet

Every failure mode maps to a stable code → a meaning → an action. Use this to route alerts and automate fixes without an LLM in the loop.

CodeWhere it appearsMeaningAction
CRITICAL_TEST_FAILUREverdictReasonCodes, decisionDriversA test marked severity: 'critical' failed — the release gate considers this blockingFix the underlying extractor/output before deploy
WARNING_TEST_FAILUREverdictReasonCodes, decisionDriversA severity: 'warning' test failed — advisoryInvestigate; accept if intentional
BASELINE_DRIFTverdictReasonCodesField schema differs from prior baselineRead driftSeverity.breaking[] + driftSeverity.nonBreaking[]
BASELINE_DRIFT_BREAKINGdecisionDrivers, scoreBreakdownRequired or critical-importance field changed type or disappearedRestore field OR update schemaContract.requiredFields + notify consumers
BASELINE_DRIFT_NONBREAKINGdecisionDrivers, scoreBreakdownNew or renamed non-required fieldsUsually safe — confirm consumers tolerate extras
COLD_STARTverdictReasonCodes, decisionDrivers, confidenceFactorCodes (cold_start_cap)No trusted baseline yet — confidence capped at 70, decision cannot be act_nowRun on a schedule with enableBaseline: true; graduate to act_now from run 2 onward
LOW_SAMPLE_SIZEdecisionDrivers, confidenceFactorCodes (low_sample_size)Fewer than 3 test casesAdd tests; cold-start math dominates with one
LOW_SUITE_COVERAGEdecisionDrivers, confidenceFactorCodes (low_suite_coverage), fleetSignalsSuite uses fewer than 60% of assertion typesRead suiteCoverage.blindSpots[] and fix the top 2–3
FLAKY_TESTdecisionDrivers, fleetSignals (TEST_FLAKY)A test's historical pass rate is below 80%Mark expectedFlaky: true OR fix the non-determinism
CRITICAL_FIELD_FAILUREfleetSignalsAssertion failed on a field declared critical in fieldImportanceProfileRead criticalityImpact.affectedCriticalFields[]
CONFIDENCE_REGRESSIONfleetSignalsConfidence score trending down over recent runsRead regressionSummary.direction + velocity; investigate recent drift
SUITE_LINT_FAILEDverdictReasonCodes (paired with decision: 'ignore')Pre-execution lint blocked the run — suite design problemRead suiteLint.issues[].code and fix the suite definition
NO_TESTSverdictReasonCodes (paired with decision: 'ignore')Both preset and testCases were emptyPick a preset OR provide at least one custom test case
RELEASE_BLOCKEDfleetSignalsVerdict is block (any reason)Halt pipeline; do not promote

First run / second run / nth run

The context.progress field tells you exactly where you are.

RunsprogressWhat's activeWhat's still warming up
0 (first)cold-startAssertions, verdict, forensic detailsNo baseline, drift, flakiness, or trust trend. Confidence capped at 70. decision ∈ {monitor, ignore}.
1–4emergingBaseline comparison (from run 2), drift fields, stability, run history begin populatingFlakiness unreliable with <5 samples. Trust trend not yet meaningful. decision can become act_now from run 2 when baseline is trusted.
5–14developingTrust trend, flakiness, auto-tune hints all reliableEarly warnings sharpen with more history. Suite health fully active.
15+matureFull intelligence: trust trend, regression velocity, blind spots, calibrated suggestions

Note: enableBaseline: true is required for baselines, drift, stability, history, and trust trend. Without it, Deploy Guard still runs all assertions and emits a verdict — but context.hasTrustedBaseline stays false and decision is capped at monitor.


Example — full input + output

Input:

{
"targetActorId": "ryanclinton/website-contact-scraper",
"preset": "contact-scraper",
"testCases": [
{
"name": "Smoke — known-good site",
"input": { "urls": ["https://example.com"] },
"assertions": {
"minResults": 1,
"requiredFields": ["emails", "domain"],
"maxDuration": 120
}
}
],
"enableBaseline": true,
"timeout": 180
}

Output — act_now + pass (safe to deploy):

{
"decision": "act_now",
"decisionReason": "pass verdict + high confidence (82/100) + 12 prior runs — act_now",
"decisionDrivers": [],
"confidenceLevel": "high",
"confidenceFactorCodes": ["healthy_history"],
"verdictReasonCodes": ["VERDICT_PASS"],
"statusHeadline": "SAFE TO DEPLOY — 2/2 passed (high confidence)",
"oneLine": "ryanclinton/website-contact-scraper: SAFE to deploy — 2/2 passed, 82/100 confidence",
"status": "pass",
"score": 82,
"totalTests": 2,
"passed": 2,
"failed": 0
}

Output — act_now + block (halt release):

{
"decision": "act_now",
"decisionReason": "block verdict + medium confidence (58/100) + 9 prior runs — act_now",
"decisionDrivers": ["CRITICAL_TEST_FAILURE", "BASELINE_DRIFT_BREAKING"],
"confidenceLevel": "medium",
"confidenceFactorCodes": ["drift_detected"],
"verdictReasonCodes": ["VERDICT_BLOCK", "CRITICAL_TEST_FAILURE", "BASELINE_DRIFT"],
"statusHeadline": "HALT RELEASE — 1/3 passed (medium confidence)",
"oneLine": "ryanclinton/website-contact-scraper: HALT — 1/3 passed, 58/100 confidence",
"status": "block",
"score": 58,
"totalTests": 3,
"passed": 1,
"failed": 2
}

Output — monitor + cold-start (first run, directional only):

{
"decision": "monitor",
"decisionReason": "pass verdict + medium confidence (70/100, cold-start capped) — monitor only",
"decisionDrivers": ["COLD_START"],
"confidenceLevel": "medium",
"confidenceFactorCodes": ["cold_start_cap"],
"verdictReasonCodes": ["VERDICT_PASS", "COLD_START"],
"statusHeadline": "PASS — 2/2 passed, low trust (monitor only)",
"oneLine": "ryanclinton/website-contact-scraper: PASS — 2/2 passed, 70/100 confidence — monitor",
"status": "pass",
"score": 70,
"totalTests": 2,
"passed": 2,
"failed": 0,
"context": { "progress": "cold-start", "hasTrustedBaseline": false, "runCount": 0 }
}

Using Deploy Guard in GitHub Actions

- name: Deploy Guard — pre-release check
run: |
RESULT=$(curl -s -X POST \
"https://api.apify.com/v2/acts/ryanclinton~actor-test-runner/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"targetActorId": "ryanclinton/my-actor",
"preset": "canary",
"enableBaseline": true
}')
DECISION=$(echo "$RESULT" | jq -r '.[0].decision')
HEADLINE=$(echo "$RESULT" | jq -r '.[0].statusHeadline')
STATUS=$(echo "$RESULT" | jq -r '.[0].status')
echo "Deploy Guard: $HEADLINE"
if [ "$DECISION" != "act_now" ] || [ "$STATUS" != "pass" ]; then
echo "::error::$HEADLINE"
exit 1
fi

The GITHUB_SUMMARY record in the default key-value store is served with Content-Type: text/markdown — ready to drop into $GITHUB_STEP_SUMMARY.


Pricing

$0.35 per test suite run (Pay-Per-Event, single test-suite event charged once per run after the report is pushed).

Your target actor's compute + that actor's own PPE charges are separate — they run on your account and bill at the target's rates. Deploy Guard only charges for the validation layer, not the underlying compute.

Deploy Guard logs the price at start:

PPE mode active — $0.35 per test suite run

And again in the final status message:

ACT NOW (deploy)2/2 passed in 8.4s — $0.35 charged

Cost guardrail: after 5 consecutive test failures, Deploy Guard breaks the loop to stop runaway sub-actor credit spend on a clearly broken target. Remaining tests are skipped; the verdict stands on what ran.


FAQ

How is Deploy Guard different from Apify's default-input test?

Apify's built-in default-input test runs your actor with {} once a day and flags it UNDER_MAINTENANCE after 3 consecutive failures. That's a single-test binary signal with no assertion detail, no drift, no confidence scoring, no per-field forensics. Deploy Guard runs a full assertion suite against arbitrary inputs, compares against a stored baseline, emits a routable decision tag, and produces GitHub/HTML/JSON reports. Default-input test is the floor; Deploy Guard is the gate.

Why is the first run always monitor?

Cold-start safety. Without a trusted baseline, Deploy Guard has no field schema history, no drift reference, no flakiness signal, and no run history to calibrate confidence. The score is capped at 70 and decision is forced to monitor. After the first run completes with enableBaseline: true, run number 2 has something to compare against and can graduate to act_now.

Can I use this in GitHub Actions?

Yes. Call the run-sync-get-dataset-items endpoint, parse the decision field, exit non-zero on anything other than act_now + status: pass. The key-value store also contains a GITHUB_SUMMARY record (Markdown) ready for $GITHUB_STEP_SUMMARY. See the example above.

Does it re-run all tests if one fails?

By default, yes — tests run sequentially in the order you provide. If 5 consecutive tests fail, a circuit breaker halts remaining tests to cap cost and the run exits cleanly with the verdict derived from what ran. Mark known-broken tests with expectedToFail: true and they won't trip the breaker.

What's the difference between verdictReasonCodes and confidenceFactorCodes?

verdictReasonCodes explain what the verdict is — pass/warn/block and the specific failures that drove it (e.g. CRITICAL_TEST_FAILURE, BASELINE_DRIFT). confidenceFactorCodes explain how much to trust the verdict — whether enough data has accumulated, whether a baseline exists, whether drift signals are active. Both are stable enums; both are additive-only within a major version.

Does it cost credits?

Yes — $0.35 per suite for the Deploy Guard layer itself, plus whatever your target actor costs per run × N test cases. A suite with 5 test cases against a $0.10-per-result scraper that returns 20 results each costs: $0.35 (Deploy Guard) + 5 × 20 × $0.10 = $10.35 total. Deploy Guard only bills the $0.35; the rest bills on the target's pricing to your account.

Can I compare two actor versions side-by-side?

No — Deploy Guard tests one actor at a time. For side-by-side A/B comparison use A/B Tester, which runs the same input against two actors in parallel and returns a pairwise decision (switch_now / canary_recommended / monitor_only / no_call).

How do I detect flaky tests?

Enable enableBaseline: true and run on a schedule. Flakiness detection activates after 5 prior runs — Deploy Guard computes a per-test pass rate across run history and flags any test with a pass rate below 80% as flaky. The stability[] array shows { name, passRate, runs, flaky } per test case. Consumers should treat flaky: true tests as non-blocking — don't gate CI on them until you've fixed the underlying non-determinism.

Can I supply different inputs per test case?

Yes — every testCase.input is independent. Use parameterizedTestCases to run the same template against many parameter sets (e.g. test the same URL shape with 20 different URLs). nameTemplate and inputTemplate support {{placeholder}} substitution.

What happens if the sub-actor times out?

Each Actor.call() is wrapped in a wall-clock race (timeout + 60s or 5 minutes minimum). On timeout, the test case is marked failed with failureType: 'timeout', and the suite continues. Two timeouts in a row don't break the suite — but 5 consecutive failures of any type trip the circuit breaker.

Why did my first run get a monitor decision even though every test passed?

Cold-start cap. The run succeeded, every assertion passed, and the verdict is pass — but without a stored baseline there's no history to calibrate confidence, so the score is capped at 70 and decision can't promote to act_now. Run it again (scheduled or manual) with enableBaseline: true and run number 2 onward will promote to act_now when the verdict stays healthy.


What Deploy Guard does NOT do

Deploy Guard is the pre-release test gate in a fleet of specialist actors. Use siblings for these adjacent jobs:

NeedUse instead
Validate schema/quality of a PRODUCTION dataset after it runs (silent data failures, coverage drops, null spikes)Output Guard — post-run data-quality monitor with incident lifecycle and channel-aware alerts
Compare two actor versions side-by-side on the same inputA/B Tester — pairwise decision engine with fairness checks and decision stability
Score a whole fleet's qualityQuality Monitor — fleet-wide quality scorer
Detect PII / GDPR / TOS risks in an actor's outputCompliance Scanner
Consolidated dashboard across the whole fleetFleet Health Report

Deploy Guard's output is designed to feed these siblings — every run appends field-rule suggestions to a shared key-value store (the AQP store) that Output Guard picks up automatically. Pre-deploy assertions that fail here become production monitoring rules there without manual sync.


License

Proprietary. Runs on Apify. Source is available inside the platform for audit but not redistributable.