Actor Test Runner — Validate Inputs, Outputs & Error Handling
Pricing
$350.00 / 1,000 test suite runs
Actor Test Runner — Validate Inputs, Outputs & Error Handling
Actor Test Runner. Available on the Apify Store with pay-per-event pricing.
Pricing
$350.00 / 1,000 test suite runs
Rating
0.0
(0)
Developer
ryan clinton
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
Deploy Guard — Release Intelligence & Regression Detection
Deploy Guard is the pre-deploy release gate in an Apify actor execution lifecycle — it runs automated test suites against a candidate build and returns a release decision (act_now / monitor / ignore) that downstream automation can branch on, without parsing prose. It's the pre-push CI gate that converts "I think the new build works" into a routable, confidence-scored verdict.
Who it's for: Apify developers who push builds on a schedule, CI/CD operators who gate deploys on a verdict, and LLM agent tool calls that need a machine-readable answer — not a dashboard.
Branch on decision (string enum). Never parse the summary or explanation prose — the string format is not stable.
Automatically run tests and block deployment if your scraper or Apify actor breaks.
What it does — in plain language
Automatically run tests and block deployment if your scraper or Apify actor breaks.
Deploy Guard runs automated test suites on an Apify actor and returns a deterministic decision (act_now, monitor, or ignore) that you can use to deploy or block a release.
- Automatically block a deployment if your Apify actor breaks. Deploy Guard returns a machine-readable
decisionfield (act_now/monitor/ignore) you can branch on directly in your CI pipeline to fail the build. - Fail your CI pipeline automatically if your actor output is invalid. Any failing assertion (min results, required fields, regex, duration, uniqueness) surfaces in
verdictReasonCodesand flips the verdict toblock. - Use this to stop bad scraper builds from being deployed. Pre-push, scheduled, or on-demand — Deploy Guard catches regressions before they reach production users.
- Check if your Apify actor is safe to deploy. Deploy Guard answers one question — "Is this Apify actor build safe to deploy?" — and returns a deterministic yes/no decision for CI/CD gating.
- Automatically detect regressions in your scraper. Deploy Guard compares each run to a stored baseline and flags breaking changes, missing fields, null-rate shifts, drift, and flaky tests.
- A testing tool for Apify actors that runs automated test suites. Presets cover common shapes (canary, scraper-smoke, api-actor, contact-scraper, ecommerce-quality, store-readiness); custom test cases handle the rest.
- Designed for AI agents to decide whether to deploy or block a build. The output is a structured, additive-only enum — an LLM agent reads
decision+decisionDrivers[]and acts without parsing prose. - Check if your code or Apify actor is safe to deploy. The same pattern works whether you're gating an actor build, a data-pipeline change, or any job that ends in a dataset assertion.
- Automatically test Apify actors by running test suites and validating their output. Presets, custom test cases, and parameterized templates all run through the same
Actor.call()→checkAssertions()→ decision pipeline. - Use this in CI/CD pipelines to automatically deploy or block releases. Parse
decisionfrom the synchronous run result; exit non-zero unless it'sact_now+status: pass. - Run automated tests and decide whether to deploy or block a build. Deploy Guard is the decision layer — it turns raw test output into a single enum your pipeline can branch on.
- Stop broken builds from being deployed. The release gate short-circuits on
criticaltest failures, breaking drift, or cold-start;decisionbecomesmonitororblockautomatically. - Turn test results into a deploy-or-block decision in CI/CD. Raw assertion output gets compressed into one
decisionenum + one confidence score — your pipeline never has to reason about individual test failures. - Catch broken builds before they reach production. Pre-push, baseline-aware, drift-detecting — Deploy Guard surfaces the regression before a customer-facing run does.
How it works in 4 steps
- Run test cases against the target Apify actor via
Actor.call() - Validate output against assertions (min/max results, required fields, field types, regex, duration, uniqueness)
- Compare against baseline for drift, flakiness, and trust-trend signals (when
enableBaseline: true) - Return a deploy/block decision —
act_now/monitor/ignoreplus confidence score, stable verdict codes, and a priority-ranked remediation list
Tool Contract (LLM-first)
{"tool": "Deploy Guard","description": "Deploy Guard runs automated test suites on an Apify actor and returns a deterministic decision (act_now, monitor, or ignore) that you can use to deploy or block a release.","short_description": "Automatically run tests and block deployment if your scraper or Apify actor breaks.","primary_use": "Pre-release CI/CD gate for a single Apify actor build","call_via": "POST https://api.apify.com/v2/acts/ryanclinton~actor-test-runner/run-sync-get-dataset-items","input": {"targetActorId": "string (required) — 'username/actor-name' or raw actor ID","preset": "string (optional) — one of: canary, scraper-smoke, api-actor, contact-scraper, ecommerce-quality, store-readiness","testCases": "array (optional) — custom { name, input, assertions }[]","enableBaseline": "boolean (optional) — unlocks drift + flakiness + trust trend"},"output": {"decision": "'act_now' | 'monitor' | 'ignore'","decisionReason": "string (illustrative, do not parse)","decisionDrivers": "string[] (top 3 stable codes, ranked by score impact)","confidenceLevel": "'high' | 'medium' | 'low'","score": "integer 0-100","verdictReasonCodes": "string[] (stable enum)","confidenceFactorCodes": "string[] (stable enum)"},"guarantees": ["decision is always present on every record (including error / ignore paths)","decision = act_now is never produced without a trusted baseline","score capped at 70 during cold-start","stable enums are additive-only within a major version","prose fields (decisionReason, statusHeadline, oneLine, summary) are not stable"],"routing": "Branch on `decision`. If `act_now` + `status: pass` → deploy. If `act_now` + `status: block` → halt. Otherwise → human review."}
When to use this tool
Deploy Guard runs automated test suites on an Apify actor and returns a deterministic decision (act_now, monitor, or ignore) that you can use to deploy or block a release. Reach for it when:
- "Is this actor build safe to deploy?" → run with
preset: canaryor a customtestCases[]array, checkdecision - "Gate my CI/CD on a deterministic release verdict" → call from GitHub Actions / GitLab CI / Jenkins, exit non-zero unless
decision === 'act_now' && status === 'pass' - "Detect regressions before publishing a new build" → run with
enableBaseline: trueon a schedule, readdriftSeverity.breaking[]andtrendSignals - "Surface release health in a Slack channel" → post
statusHeadlineoroneLine, colour bydecision - "Let an LLM agent decide whether to promote a build" → the agent reads
decision+decisionDrivers[]+decisionReason(one-line summary) and acts
Do NOT use this to: score Store-readiness / README quality / agent-readiness (that's Quality Monitor), compare two actor versions (use A/B Tester), monitor production datasets (use Output Guard).
5-second read — decision field
decision | What it means | What automation should do |
|---|---|---|
act_now | Verdict is trusted (pass or block) with medium+ confidence AND a trusted baseline | Deploy (on pass) or halt the pipeline (on block). Safe to fire Slack/PagerDuty/webhook. |
monitor | Cold-start, low confidence, or a warn verdict | Do NOT auto-deploy. Notify a human reviewer. |
ignore | No tests were executed | Misconfiguration — no preset and no custom test cases. Investigate input. |
Cold-start guarantee: without a trusted baseline (first run, or baseline disabled), decision is never act_now. The confidence score is capped at 70 and confidenceFactorCodes carries cold_start_cap.
Stable machine contract vs illustrative copy
Deploy Guard separates what is guaranteed stable for automation from what's human-facing prose.
Stable (additive-only within a major version — safe to branch on):
decisionenum:act_now/monitor/ignoreconfidenceLevelenum:high/medium/lowstatusenum:pass/warn/blockverdictReasonCodes[]— additive enum (documented below)confidenceFactorCodes[]— additive enum (documented below; includeslow_suite_coveragewhen suiteCoverage.score < 60)decisionDrivers[]— ranked subset of the above (top 3, impact-ordered)scoreBreakdown.deductions[].code— additive enum (CRITICAL_TEST_FAILURE,WARNING_TEST_FAILURE,BASELINE_DRIFT_BREAKING,BASELINE_DRIFT_NONBREAKING,LOW_SAMPLE_SIZE,SMALL_HISTORY,LOW_SUITE_COVERAGE,FLAKY_TEST)suiteLint.statusenum:'pass' | 'warn' | 'fail'suiteLint.issues[].severityenum:'error' | 'warning' | 'info'suiteLint.issues[].code— additive enum (NO_TESTS_SUPPLIED,SINGLE_INPUT_VARIANT,NO_DURATION_GUARD,NO_CRITICAL_CHECKS,SINGLE_TEST_BUT_CI_GATING_HINT)trendSignals[]— additive-only enum (known entries:confidence_regression_fast/_moderate/_slow,confidence_improving_fast/_moderate/_slow,flaky_tests_present,flakiness_clean,breaking_drift_detected,schema_expanding_noncritical,execution_fast_all_tests)driftSeveritytiers:breaking/nonBreaking/informational/expectedfleetSignals[].code— additive enum (documented in dataset schema)confidenceBreakdownsub-bands: samehigh/medium/lowenumcontext.progressenum:cold-start/emerging/developing/matureremediation[].typeenum:schema_drift/assertion_failure/flaky_test/low_coverage/missing_baseline/suite_design- Dataset field names + types (declared in
dataset_schema.json)
Illustrative only — format may evolve, do NOT parse:
decisionReason,statusHeadline,oneLine,summary,explanationreleaseDecision.recommendation,releaseDecision.reason- Status messages (
setStatusMessage) - Log lines
recommendations[]strings
If you need to react to something the prose contains, look for a machine code instead.
Why this beats Apify's daily default-input test
Apify's built-in default-input test runs your actor with {} once a day and flips it to UNDER_MAINTENANCE after 3 consecutive failures. That's a single binary signal — no assertion detail, no drift, no confidence score, no per-field forensics, no CI hook. Deploy Guard runs a full assertion suite against arbitrary inputs, compares against a stored baseline, emits a routable decision tag, produces GitHub/HTML/JSON reports, and calibrates confidence over time. Default-input test is the floor. Deploy Guard is the gate.
How it works
- You call Deploy Guard with a target actor ID and either a preset (e.g.
canary) or an array of custom test cases - For each test case, Deploy Guard runs the target actor via
Actor.call()with the test's input, memory, and timeout - Dataset items from the child run are validated against assertions (min/max results, required fields, field types, regex patterns, duration limits, uniqueness, ranges)
- With
enableBaseline: true, Deploy Guard compares the run's field schema against a stored baseline — flagging new/missing fields, type changes, null-rate shifts, and test flakiness - The release decision is derived from: critical failures, warning failures, drift significance, trust trend, confidence factors
- The
decisionscalar is computed from verdict + confidence level + baseline trust, then emitted alongside stable machine codes
One dataset item per run (the TestSuiteReport), plus three records in the default key-value store: SUMMARY (JSON, flattened decision layer), GITHUB_SUMMARY (Markdown, text/markdown), and HTML_REPORT (HTML, text/html).
Presets
Pick a preset or write custom test cases — both can run together in the same suite.
| Preset | Best for | What it runs |
|---|---|---|
canary | Pre-push confidence check | Single fast test with default input, under 10 seconds |
scraper-smoke | Basic crawler health | Default input, checks results exist, 120s timeout |
api-actor | API wrapper validation | Default input, response structure + timing checks |
contact-scraper | Email extractors | Email format regex, domain validation, richness checks |
ecommerce-quality | Product scrapers | Price is number ≥0, URL is https, title non-empty, unique URLs |
store-readiness | Pre-publish audit | Default input produces output, performance guardrail (120s) |
When both a preset and testCases are supplied, Deploy Guard runs both. Total child runs = preset test count + custom test count.
Input schema — the 5 inputs that matter
{"targetActorId": "username/actor-name","preset": "canary","testCases": [{"name": "Smoke — default input","input": {},"assertions": { "minResults": 1, "maxDuration": 120 }}],"enableBaseline": true,"timeout": 300}
targetActorId(required) —username/actor-nameor the raw actor IDpreset— one of the 6 presets above, or omit for custom-only runstestCases— array of{ name, input, assertions, expectedToFail?, schemaContract? }enableBaseline— opt into baseline + drift + flakiness + trust trend; activates the cold-start → emerging → developing → mature maturity progressiontimeout— seconds per test (default 300, max 3600); each child run is wrapped in a wall-clock guard attimeout + 60s
Also supported: parameterizedTestCases for {{placeholder}} templating across parameter sets, memory (MB per child run, default 512), maxSampleItems (default 1000, max 10000 for full-scan mode), fieldImportanceProfile for per-field severity overrides (drives criticalityImpact and driftSeverity tiering).
Assertion reference: minResults, maxResults, maxDuration, requiredFields[], fieldTypes{field: 'string'|'number'|'boolean'|'array'|'object'}, noEmptyFields[], fieldPatterns{field: regex}, fieldRanges{field: {min, max}}, uniqueFields[], severity: 'critical'|'warning'.
Waivers + expected instability. Real CI pipelines need controlled exceptions without silently hiding regressions:
- Per test case:
expectedFlaky: true(test is known non-deterministic; don't weight toward flakiness penalty),allowedDriftFields: ["badgeText"](tolerate drift on listed fields),temporaryWaiverUntil: "2026-05-15T00:00:00.000Z"(scoped waiver with expiry),waiverReason: "site rollout in progress"(audit trail). - Global: top-level
waivers: [{ testName, allowedDriftFields, temporaryWaiverUntil, reason }]mirrors the same shape, applied by test name. Expired waivers are ignored automatically. - Effect: fields matched by an active waiver land in
driftSeverity.expected[]instead ofbreaking/nonBreaking, so the decision engine doesn't punish intentional change.
Output — decision layer first
Every run emits a single TestSuiteReport to the default dataset. Read the decision-layer fields first.
Decision layer (machine-routable — branch on these)
| Field | Type | Description |
|---|---|---|
decision | 'act_now'|'monitor'|'ignore' | Routable decision tag. Never parse prose — branch on this. |
decisionReason | string | One-line plain-language justification. Usable in logs, alerts, audit trails. |
decisionDrivers | string[] | Top 2–3 stable codes ranked by impact on the final decision — surface these in CI logs and Slack alerts. |
confidenceLevel | 'high'|'medium'|'low' | Banded from score: high ≥75, medium ≥50, low <50. |
confidenceBreakdown | object | Sub-bands: executionConfidence / schemaConfidence / historyConfidence / suiteDesignConfidence, each high/medium/low. Tells you why confidence is what it is. |
confidenceFactorCodes | string[] | Stable codes explaining the confidence score. Additive-only enum. |
verdictReasonCodes | string[] | Stable codes behind the pass/warn/block verdict. Additive-only enum. |
statusHeadline | string | Human-readable one-liner (e.g. SAFE TO DEPLOY — 5/5 passed (high confidence)). |
oneLine | string | Actor-name-prefixed summary for Slack, email subjects, agent summaries. |
context | object | { progress, progressMessage, hasTrustedBaseline, runCount } — learning maturity. |
Explainability layer (read these to understand why the decision landed)
| Field | Type | Description |
|---|---|---|
scoreBreakdown | object | Auditable scoring — { startingScore: 100, deductions[], caps[], finalScore }. Each deduction carries a code, points, count, and reason. Same inputs always produce the same breakdown. |
remediation | RemediationItem[] | Priority-ranked fix cards. Each has whyItMatters + suggestedFix + ownerHint + affectedFields. Read top-down to fix highest-impact issues first. |
suiteLint | object | Pre-execution lint of the test suite definition itself — catches NO_TESTS_SUPPLIED, SINGLE_INPUT_VARIANT, NO_DURATION_GUARD, NO_CRITICAL_CHECKS, SINGLE_TEST_BUT_CI_GATING_HINT. Fails fast on suite design problems before burning compute. |
suiteCoverage | object | { score, assertionTypesUsed[], blindSpots[], testCount, hasSchemaContract }. Guards against false confidence from a thin suite. |
driftSeverity | object | Drift findings tiered: breaking / nonBreaking / informational / expected. Breaking = required or critical-importance field removed or type-changed. Expected = field listed in allowedDriftFields or a test waiver. |
criticalityImpact | object | { criticalFieldsHealthy, criticalFieldFailures, nonCriticalFieldFailures, affectedCriticalFields[] }. Derived from fieldImportanceProfile. |
regressionSummary | object | { direction: 'better'|'worse'|'stable', velocity, confidence }. Null until ≥2 prior confidence snapshots exist. |
trendSignals | string[] | Compact trend codes: confidence_regression_moderate, flaky_tests_present, breaking_drift_detected, execution_fast_all_tests. |
fleetSignals | FleetSignal[] | Stable machine codes for fleet-wide aggregation. Additive-only enum: SCHEMA_DRIFT_CRITICAL, SCHEMA_DRIFT_NONCRITICAL, TEST_FLAKY, LOW_SUITE_COVERAGE, CRITICAL_FIELD_FAILURE, CONFIDENCE_REGRESSION, RELEASE_BLOCKED. |
confidenceFactorCodes vocabulary (additive-only — new codes may arrive; existing codes won't be renamed or removed within a major version):
cold_start_cap— no trusted baseline; confidence capped at 70low_sample_size— fewer than 3 test cases executedsmall_history— run history exists but has fewer than 5 prior runshealthy_history— trusted baseline + zero failures this rundrift_detected— current field schema differs materially from baselinelow_suite_coverage— suite exercises fewer than 60% of the assertion surface (coverage score <60)suite_lint_failed— pre-execution lint blocked the run
verdictReasonCodes vocabulary:
VERDICT_PASS/VERDICT_WARN/VERDICT_BLOCK— raw statusCRITICAL_TEST_FAILURE/WARNING_TEST_FAILURE— per-severity failure countsBASELINE_DRIFT— drift detected against prior baselineCOLD_START— no trusted baseline yetSUITE_LINT_FAILED— pre-execution lint failed; no tests ran (paired withdecision: 'ignore')NO_TESTS— no preset and no custom test cases supplied (paired withdecision: 'ignore')
Fleet signals (for downstream aggregators)
fleetSignals[] is a stable-code array designed for Fleet Analytics / Slack routing / Zapier. Every entry carries { code, severity, scope, actionability, detail?, field? }. The enum is additive-only within a major version.
| Code | Severity | Scope | Meaning |
|---|---|---|---|
SCHEMA_DRIFT_CRITICAL | critical | field | Breaking drift on a required or critical-importance field |
SCHEMA_DRIFT_NONCRITICAL | info | suite | Non-breaking drift across one or more fields |
TEST_FLAKY | warning | test | Individual test's historical pass rate below 80% |
LOW_SUITE_COVERAGE | warning | suite | Coverage score < 60 — suite has blind spots |
CRITICAL_FIELD_FAILURE | critical | field | Assertion failed on a fieldImportanceProfile.critical field |
CONFIDENCE_REGRESSION | warning | run | Recent confidence scores trending down |
RELEASE_BLOCKED | critical | run | Verdict is block — do not promote |
Verdict + analytics (existing fields)
| Field | Type | Description |
|---|---|---|
status | 'pass'|'warn'|'block' | Raw verdict before the decision layer. Use decision for automation, status for display. |
score | integer 0–100 | Composite confidence score. Capped at 70 during cold-start. |
summary | string | Plain-language explanation. Not machine-stable. |
recommendations | string[] | Suggested next actions derived from the failure mix. |
signals | object | { errorCount, warningCount, criticalCount, driftDetected, metrics } |
actorName / actorId | string | The tested actor's display name + ID. |
totalTests / passed / failed / expectedFailures | integer | Count breakdown. |
totalDuration | number | Seconds across all test cases. |
results | TestCaseResult[] | Per-test: assertions, schema contract, duration, forensics, error classification. |
releaseDecision | object | Full detail: root cause, prioritised failures, actions, trust trend, regression velocity, early warnings, blind spots, suite health. |
drift | DriftReport | null | Field-level diff vs previous baseline. Null until enableBaseline is on + a baseline exists. |
stability | TestStability[] | null | Per-test pass rate + flakiness flag. Null on cold-start. |
history | RunSnapshot[] | null | Last 20 run snapshots. Null on cold-start. |
detectedActorType | string | Heuristic: scraper / contact-scraper / api-actor / ecommerce / unknown. |
suggestedPreset | string | null | Preset that would give richer validation for the detected type. |
testedAt | ISO 8601 | Timestamp of test completion. |
Key-value store outputs
SUMMARY— flattened decision layer + counts + failed tests + context (dashboards should read this)GITHUB_SUMMARY— Markdown ready for$GITHUB_STEP_SUMMARYin ActionsHTML_REPORT— standalone HTML ready to upload as a CI artefact
Automation contract
| Consumer | Read this field | Why |
|---|---|---|
| Slack / PagerDuty router | decision + statusHeadline | Enum routing, headline as alert title |
| CI/CD gate (GitHub Actions, etc.) | decision (exit 0 only on act_now + status: pass) | Stable enum, no prose parsing |
| LLM agent tool call | oneLine + verdictReasonCodes | One-liner for the model, codes for deterministic follow-up |
| Human debugging | releaseDecision.rootCause + results[].forensics | Traces back to the failing assertion |
Decision invariants
Deploy Guard enforces these in code — downstream consumers can rely on them without defensive checks:
decision = act_now implies:context.hasTrustedBaseline = trueconfidenceLevel != 'low'status != 'warn'totalTests > 0suiteLint.status != 'fail'decision = monitor implies at least one of:context.hasTrustedBaseline = false (cold-start)confidenceLevel = 'low'status = 'warn'decision = ignore implies:totalTests = 0 OR suiteLint.status = 'fail'To disambiguate why ignore fired, read verdictReasonCodes:'SUITE_LINT_FAILED' → suite was invalid, zero tests executedotherwise → preset + custom testCases both emptydecisionDrivers contract:- max length = 3- ordered by absolute score-impact points (higher first)- ties broken by alphabetical code- empty only when: decision = act_now + healthy history, OR decision = ignore(ignore paths already surface their reason via verdictReasonCodes:'NO_TESTS' or 'SUITE_LINT_FAILED')remediation[] ordering (deterministic across runs):1. severity (critical > warning > info)2. score impact (per DEDUCTION_POINTS table)3. presence of affected-field list4. stable tie-break by typeitems[].priority reflects this 1..N order after sort.
Decision flow
Input ─────▶ Resolve test cases (preset + custom + parameterized)│▼Run each test via Actor.call() ◀── 5-consecutive-failure→ listItems() circuit breaker (cost guard)→ checkAssertions()│▼computeReleaseDecision(root cause, trust trend,drift, stability, suite health)│▼hasTrustedBaseline ?╱ ╲no yes▼ ▼score = min(score, 70) │+ cold_start_cap code │╲ ╱▼ ▼confidenceLevel = band(score)│▼decision:ignore (totalTests = 0)monitor (cold-start OR low confidence OR warn verdict)act_now ((pass or block) + medium/high + trusted baseline)│▼pushData → setStatusMessage →KV SUMMARY / GITHUB_SUMMARY / HTML_REPORT→ AQP store (field-rule suggestions for Output Guard)
When to trust the decision
| Scenario | decision | Confidence | Action |
|---|---|---|---|
| 5+ prior runs, pass, high confidence, no drift | act_now | high | Deploy |
| 5+ prior runs, block, critical failure, high confidence | act_now | high | Halt + investigate |
| First run ever | monitor | ≤70 (capped) | Review manually; run establishes baseline |
| Drift detected on a previously-stable field | monitor or act_now | varies | Inspect drift.changeSummary — may be intentional |
| 1 flaky test in a 5-test suite | act_now | medium | Acceptable if expectedToFail: true |
When NOT to trust the decision
| Scenario | Why | What to do instead |
|---|---|---|
monitor + cold_start_cap code | No baseline context yet | Run on a schedule for 5+ iterations before gating CI |
verdictReasonCodes contains BASELINE_DRIFT | Prior schema has changed | Inspect drift; may be intentional or regression |
| Single test in the suite | low_sample_size code | Add at least 3 tests; cold-start math dominates with one |
Flakiness in stability | One test's pass rate < 80% | Fix the flake or mark expectedToFail: true |
Fewer than 5 runs in history | small_history code | Trust trend is still warming up — wait for maturity |
Failure interpretation cheat sheet
Every failure mode maps to a stable code → a meaning → an action. Use this to route alerts and automate fixes without an LLM in the loop.
| Code | Where it appears | Meaning | Action |
|---|---|---|---|
CRITICAL_TEST_FAILURE | verdictReasonCodes, decisionDrivers | A test marked severity: 'critical' failed — the release gate considers this blocking | Fix the underlying extractor/output before deploy |
WARNING_TEST_FAILURE | verdictReasonCodes, decisionDrivers | A severity: 'warning' test failed — advisory | Investigate; accept if intentional |
BASELINE_DRIFT | verdictReasonCodes | Field schema differs from prior baseline | Read driftSeverity.breaking[] + driftSeverity.nonBreaking[] |
BASELINE_DRIFT_BREAKING | decisionDrivers, scoreBreakdown | Required or critical-importance field changed type or disappeared | Restore field OR update schemaContract.requiredFields + notify consumers |
BASELINE_DRIFT_NONBREAKING | decisionDrivers, scoreBreakdown | New or renamed non-required fields | Usually safe — confirm consumers tolerate extras |
COLD_START | verdictReasonCodes, decisionDrivers, confidenceFactorCodes (cold_start_cap) | No trusted baseline yet — confidence capped at 70, decision cannot be act_now | Run on a schedule with enableBaseline: true; graduate to act_now from run 2 onward |
LOW_SAMPLE_SIZE | decisionDrivers, confidenceFactorCodes (low_sample_size) | Fewer than 3 test cases | Add tests; cold-start math dominates with one |
LOW_SUITE_COVERAGE | decisionDrivers, confidenceFactorCodes (low_suite_coverage), fleetSignals | Suite uses fewer than 60% of assertion types | Read suiteCoverage.blindSpots[] and fix the top 2–3 |
FLAKY_TEST | decisionDrivers, fleetSignals (TEST_FLAKY) | A test's historical pass rate is below 80% | Mark expectedFlaky: true OR fix the non-determinism |
CRITICAL_FIELD_FAILURE | fleetSignals | Assertion failed on a field declared critical in fieldImportanceProfile | Read criticalityImpact.affectedCriticalFields[] |
CONFIDENCE_REGRESSION | fleetSignals | Confidence score trending down over recent runs | Read regressionSummary.direction + velocity; investigate recent drift |
SUITE_LINT_FAILED | verdictReasonCodes (paired with decision: 'ignore') | Pre-execution lint blocked the run — suite design problem | Read suiteLint.issues[].code and fix the suite definition |
NO_TESTS | verdictReasonCodes (paired with decision: 'ignore') | Both preset and testCases were empty | Pick a preset OR provide at least one custom test case |
RELEASE_BLOCKED | fleetSignals | Verdict is block (any reason) | Halt pipeline; do not promote |
First run / second run / nth run
The context.progress field tells you exactly where you are.
| Runs | progress | What's active | What's still warming up |
|---|---|---|---|
| 0 (first) | cold-start | Assertions, verdict, forensic details | No baseline, drift, flakiness, or trust trend. Confidence capped at 70. decision ∈ {monitor, ignore}. |
| 1–4 | emerging | Baseline comparison (from run 2), drift fields, stability, run history begin populating | Flakiness unreliable with <5 samples. Trust trend not yet meaningful. decision can become act_now from run 2 when baseline is trusted. |
| 5–14 | developing | Trust trend, flakiness, auto-tune hints all reliable | Early warnings sharpen with more history. Suite health fully active. |
| 15+ | mature | Full intelligence: trust trend, regression velocity, blind spots, calibrated suggestions | — |
Note: enableBaseline: true is required for baselines, drift, stability, history, and trust trend. Without it, Deploy Guard still runs all assertions and emits a verdict — but context.hasTrustedBaseline stays false and decision is capped at monitor.
Example — full input + output
Input:
{"targetActorId": "ryanclinton/website-contact-scraper","preset": "contact-scraper","testCases": [{"name": "Smoke — known-good site","input": { "urls": ["https://example.com"] },"assertions": {"minResults": 1,"requiredFields": ["emails", "domain"],"maxDuration": 120}}],"enableBaseline": true,"timeout": 180}
Output — act_now + pass (safe to deploy):
{"decision": "act_now","decisionReason": "pass verdict + high confidence (82/100) + 12 prior runs — act_now","decisionDrivers": [],"confidenceLevel": "high","confidenceFactorCodes": ["healthy_history"],"verdictReasonCodes": ["VERDICT_PASS"],"statusHeadline": "SAFE TO DEPLOY — 2/2 passed (high confidence)","oneLine": "ryanclinton/website-contact-scraper: SAFE to deploy — 2/2 passed, 82/100 confidence","status": "pass","score": 82,"totalTests": 2,"passed": 2,"failed": 0}
Output — act_now + block (halt release):
{"decision": "act_now","decisionReason": "block verdict + medium confidence (58/100) + 9 prior runs — act_now","decisionDrivers": ["CRITICAL_TEST_FAILURE", "BASELINE_DRIFT_BREAKING"],"confidenceLevel": "medium","confidenceFactorCodes": ["drift_detected"],"verdictReasonCodes": ["VERDICT_BLOCK", "CRITICAL_TEST_FAILURE", "BASELINE_DRIFT"],"statusHeadline": "HALT RELEASE — 1/3 passed (medium confidence)","oneLine": "ryanclinton/website-contact-scraper: HALT — 1/3 passed, 58/100 confidence","status": "block","score": 58,"totalTests": 3,"passed": 1,"failed": 2}
Output — monitor + cold-start (first run, directional only):
{"decision": "monitor","decisionReason": "pass verdict + medium confidence (70/100, cold-start capped) — monitor only","decisionDrivers": ["COLD_START"],"confidenceLevel": "medium","confidenceFactorCodes": ["cold_start_cap"],"verdictReasonCodes": ["VERDICT_PASS", "COLD_START"],"statusHeadline": "PASS — 2/2 passed, low trust (monitor only)","oneLine": "ryanclinton/website-contact-scraper: PASS — 2/2 passed, 70/100 confidence — monitor","status": "pass","score": 70,"totalTests": 2,"passed": 2,"failed": 0,"context": { "progress": "cold-start", "hasTrustedBaseline": false, "runCount": 0 }}
Using Deploy Guard in GitHub Actions
- name: Deploy Guard — pre-release checkrun: |RESULT=$(curl -s -X POST \"https://api.apify.com/v2/acts/ryanclinton~actor-test-runner/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"targetActorId": "ryanclinton/my-actor","preset": "canary","enableBaseline": true}')DECISION=$(echo "$RESULT" | jq -r '.[0].decision')HEADLINE=$(echo "$RESULT" | jq -r '.[0].statusHeadline')STATUS=$(echo "$RESULT" | jq -r '.[0].status')echo "Deploy Guard: $HEADLINE"if [ "$DECISION" != "act_now" ] || [ "$STATUS" != "pass" ]; thenecho "::error::$HEADLINE"exit 1fi
The GITHUB_SUMMARY record in the default key-value store is served with Content-Type: text/markdown — ready to drop into $GITHUB_STEP_SUMMARY.
Pricing
$0.35 per test suite run (Pay-Per-Event, single test-suite event charged once per run after the report is pushed).
Your target actor's compute + that actor's own PPE charges are separate — they run on your account and bill at the target's rates. Deploy Guard only charges for the validation layer, not the underlying compute.
Deploy Guard logs the price at start:
PPE mode active — $0.35 per test suite run
And again in the final status message:
ACT NOW (deploy) — 2/2 passed in 8.4s — $0.35 charged
Cost guardrail: after 5 consecutive test failures, Deploy Guard breaks the loop to stop runaway sub-actor credit spend on a clearly broken target. Remaining tests are skipped; the verdict stands on what ran.
FAQ
How is Deploy Guard different from Apify's default-input test?
Apify's built-in default-input test runs your actor with {} once a day and flags it UNDER_MAINTENANCE after 3 consecutive failures. That's a single-test binary signal with no assertion detail, no drift, no confidence scoring, no per-field forensics. Deploy Guard runs a full assertion suite against arbitrary inputs, compares against a stored baseline, emits a routable decision tag, and produces GitHub/HTML/JSON reports. Default-input test is the floor; Deploy Guard is the gate.
Why is the first run always monitor?
Cold-start safety. Without a trusted baseline, Deploy Guard has no field schema history, no drift reference, no flakiness signal, and no run history to calibrate confidence. The score is capped at 70 and decision is forced to monitor. After the first run completes with enableBaseline: true, run number 2 has something to compare against and can graduate to act_now.
Can I use this in GitHub Actions?
Yes. Call the run-sync-get-dataset-items endpoint, parse the decision field, exit non-zero on anything other than act_now + status: pass. The key-value store also contains a GITHUB_SUMMARY record (Markdown) ready for $GITHUB_STEP_SUMMARY. See the example above.
Does it re-run all tests if one fails?
By default, yes — tests run sequentially in the order you provide. If 5 consecutive tests fail, a circuit breaker halts remaining tests to cap cost and the run exits cleanly with the verdict derived from what ran. Mark known-broken tests with expectedToFail: true and they won't trip the breaker.
What's the difference between verdictReasonCodes and confidenceFactorCodes?
verdictReasonCodes explain what the verdict is — pass/warn/block and the specific failures that drove it (e.g. CRITICAL_TEST_FAILURE, BASELINE_DRIFT). confidenceFactorCodes explain how much to trust the verdict — whether enough data has accumulated, whether a baseline exists, whether drift signals are active. Both are stable enums; both are additive-only within a major version.
Does it cost credits?
Yes — $0.35 per suite for the Deploy Guard layer itself, plus whatever your target actor costs per run × N test cases. A suite with 5 test cases against a $0.10-per-result scraper that returns 20 results each costs: $0.35 (Deploy Guard) + 5 × 20 × $0.10 = $10.35 total. Deploy Guard only bills the $0.35; the rest bills on the target's pricing to your account.
Can I compare two actor versions side-by-side?
No — Deploy Guard tests one actor at a time. For side-by-side A/B comparison use A/B Tester, which runs the same input against two actors in parallel and returns a pairwise decision (switch_now / canary_recommended / monitor_only / no_call).
How do I detect flaky tests?
Enable enableBaseline: true and run on a schedule. Flakiness detection activates after 5 prior runs — Deploy Guard computes a per-test pass rate across run history and flags any test with a pass rate below 80% as flaky. The stability[] array shows { name, passRate, runs, flaky } per test case. Consumers should treat flaky: true tests as non-blocking — don't gate CI on them until you've fixed the underlying non-determinism.
Can I supply different inputs per test case?
Yes — every testCase.input is independent. Use parameterizedTestCases to run the same template against many parameter sets (e.g. test the same URL shape with 20 different URLs). nameTemplate and inputTemplate support {{placeholder}} substitution.
What happens if the sub-actor times out?
Each Actor.call() is wrapped in a wall-clock race (timeout + 60s or 5 minutes minimum). On timeout, the test case is marked failed with failureType: 'timeout', and the suite continues. Two timeouts in a row don't break the suite — but 5 consecutive failures of any type trip the circuit breaker.
Why did my first run get a monitor decision even though every test passed?
Cold-start cap. The run succeeded, every assertion passed, and the verdict is pass — but without a stored baseline there's no history to calibrate confidence, so the score is capped at 70 and decision can't promote to act_now. Run it again (scheduled or manual) with enableBaseline: true and run number 2 onward will promote to act_now when the verdict stays healthy.
What Deploy Guard does NOT do
Deploy Guard is the pre-release test gate in a fleet of specialist actors. Use siblings for these adjacent jobs:
| Need | Use instead |
|---|---|
| Validate schema/quality of a PRODUCTION dataset after it runs (silent data failures, coverage drops, null spikes) | Output Guard — post-run data-quality monitor with incident lifecycle and channel-aware alerts |
| Compare two actor versions side-by-side on the same input | A/B Tester — pairwise decision engine with fairness checks and decision stability |
| Score a whole fleet's quality | Quality Monitor — fleet-wide quality scorer |
| Detect PII / GDPR / TOS risks in an actor's output | Compliance Scanner |
| Consolidated dashboard across the whole fleet | Fleet Health Report |
Deploy Guard's output is designed to feed these siblings — every run appends field-rule suggestions to a shared key-value store (the AQP store) that Output Guard picks up automatically. Pre-deploy assertions that fail here become production monitoring rules there without manual sync.
License
Proprietary. Runs on Apify. Source is available inside the platform for audit but not redistributable.