Dataset Quality Gate - Schema & Data QA avatar

Dataset Quality Gate - Schema & Data QA

Pricing

Pay per usage

Go to Apify Store
Dataset Quality Gate - Schema & Data QA

Dataset Quality Gate - Schema & Data QA

Validate Apify Datasets by pasted items, Dataset ID, or Run ID before delivery, automation, or AI/RAG ingestion. Catch schema drift, missing fields, duplicates, and bad URLs/emails/dates.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Juyeop Park

Juyeop Park

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Dataset Quality Gate - Schema & Data QA for Apify

Stop bad JSON data before it reaches customers, automations, or AI/RAG pipelines.

Dataset Quality Gate validates Apify Dataset items and turns them into a clear pass/fail quality report. You can now use it in three ways:

  1. Paste JSON items directly.
  2. Provide an Apify Dataset ID and let the Actor fetch the items.
  3. Provide an Apify Run ID and let the Actor validate that run's default dataset.

It checks required fields, schema/type drift, duplicate IDs or URLs, URL/email/date formats, and field-level profiles — without external websites, proxies, cookies, AI tokens, or third-party APIs.

Best for

  • Scraped data delivery QA — catch missing URLs, empty titles, malformed emails, bad dates, and duplicate records before a customer sees them.
  • AI/RAG ingestion guardrails — reject malformed records before embeddings, retrieval, or agent workflows consume them.
  • Apify workflow monitoring — validate the default dataset from a finished Actor run by pasting only the runId.
  • Schema drift detection — confirm that fields remain strings, numbers, booleans, arrays, objects, or nulls as expected.
  • Automation fail gates — stop CI/CD, webhook, or scheduled pipelines when data quality falls below the expected contract.

Why use this instead of manual inspection?

Manual spot checks miss silent failures: one upstream selector change can leave hundreds of records with missing URLs, empty titles, duplicated IDs, or type changes that only break later. Dataset Quality Gate makes those checks repeatable and machine-readable while still producing a Markdown report for human review.

Input modes

1. Validate by Run ID — easiest for Apify workflows

Use this when you want to check the output of a previous Actor run.

{
"reportName": "Daily scraper QA",
"runId": "YOUR_ACTOR_RUN_ID",
"maxItems": 1000,
"validateAllItems": false,
"requiredFields": ["id", "url", "title"],
"uniqueFields": ["url"],
"formatRules": { "url": "url" }
}

The Actor resolves the run's default dataset automatically.

2. Validate by Dataset ID

Use this when you already know the dataset to inspect.

{
"reportName": "Customer delivery dataset QA",
"datasetId": "YOUR_DATASET_ID",
"maxItems": 5000,
"validateAllItems": true,
"requiredFields": ["id", "url", "email"],
"uniqueFields": ["id", "url"],
"formatRules": {
"url": "url",
"email": "email"
}
}

validateAllItems=true paginates through the dataset until it is exhausted or maxItems is reached.

3. Validate pasted JSON items

Use this for a quick manual sample or local testing.

{
"reportName": "Daily lead dataset QA",
"items": [
{
"id": "lead-1",
"url": "https://example.com/company-a",
"email": "owner@example.com",
"publishedAt": "2026-05-20"
},
{
"id": "lead-2",
"url": "https://example.com/company-b",
"email": "sales@example.com",
"publishedAt": "2026-05-21"
}
],
"requiredFields": ["id", "url", "email"],
"expectedSchema": {
"id": "string",
"url": "string",
"email": "string",
"publishedAt": "string"
},
"uniqueFields": ["id", "url"],
"formatRules": {
"url": "url",
"email": "email",
"publishedAt": "date"
}
}

If datasetId or runId is provided, the external source is used and pasted items are ignored.

Sampling and full-dataset controls

  • maxItems — safety cap for how many dataset records to check. Default: 1000, maximum: 50000.
  • validateAllItems=false — fetch one sample page using maxItems.
  • validateAllItems=true — paginate until the dataset is exhausted or maxItems is reached.
  • itemOffset — skip records before validation.
  • itemOrder=first — validate the earliest stored items first.
  • itemOrder=last — validate the latest stored items first.

The output includes source metadata such as source type, dataset ID, run ID, checked item count, available item count, and whether the source was truncated by the configured limit.

What it checks

  • Required field presence and non-empty values
  • Expected top-level JSON types: string, number, boolean, object, array, null
  • Duplicate values for configured unique fields
  • URL, email, and date format checks
  • Field-level profiling:
    • present / missing counts
    • null counts
    • empty string counts
    • type distribution
    • sample values

Supported expectedSchema values: string, number, boolean, object, array, null.

Supported formatRules values: url, email, date.

Unsupported rule values fail fast with a clear input error.

Run modes

1. Report-only mode — default

Set failRunOnError=false.

The Actor writes the summary dataset item plus full OUTPUT and REPORT artifacts. The Actor run can still succeed even when the quality report status is fail.

Use this for audits, dashboards, scheduled QA reports, manual review, and exploratory checks.

2. Fail-gate mode

Set failRunOnError=true.

The Actor writes the full JSON and Markdown report first, then fails the run if validation errors are found.

Use this for CI/CD gates, webhook pipelines, production automations, and any workflow where bad data must stop the next step.

Output

The Actor pushes one summary item to the default dataset:

{
"type": "dataset_quality_report",
"status": "fail",
"passed": false,
"reportName": "Daily lead dataset QA",
"totalItems": 1000,
"fieldCount": 12,
"totalIssues": 5,
"errorCount": 5,
"warningCount": 0,
"issueCountsByCode": {
"REQUIRED_FIELD_MISSING": 1,
"DUPLICATE_UNIQUE_FIELD": 1,
"FORMAT_INVALID": 3
},
"sourceType": "run",
"sourceDatasetId": "abc123",
"sourceRunId": "run123",
"sourceLoadedItems": 1000,
"sourceTotalAvailable": 2500,
"sourceTruncated": true
}

It also stores full artifacts in the default key-value store:

  • OUTPUT — full JSON report including all issues, field profiles, and source metadata
  • REPORT — Markdown report for human review and sharing

Common workflow recipes

Customer delivery QA

{
"runId": "YOUR_SCRAPER_RUN_ID",
"requiredFields": ["id", "url", "title"],
"uniqueFields": ["id", "url"],
"formatRules": { "url": "url" }
}

AI/RAG ingestion QA

{
"datasetId": "YOUR_DATASET_ID",
"requiredFields": ["url", "title", "text"],
"expectedSchema": {
"url": "string",
"title": "string",
"text": "string"
},
"uniqueFields": ["url"],
"formatRules": { "url": "url" }
}

Production fail gate

{
"runId": "YOUR_UPSTREAM_RUN_ID",
"failRunOnError": true,
"requiredFields": ["id", "url", "email"],
"uniqueFields": ["id", "email"],
"formatRules": { "url": "url", "email": "email" }
}

Pricing

This Actor is configured for pay per event (PPE) pricing.

  • Quality report: $0.05 per completed quality report
  • Actor start: $0.005 per run start

The pricing is intentionally simple and predictable: you pay for one completed dataset QA report, not for external API calls or proxy usage. The Actor does not call AI APIs or third-party scraping services.

Limitations

  • Validation is top-level JSON field validation only.
  • Nested object schema validation is not included yet.
  • runId and datasetId must be accessible from the account running the Actor.
  • validateAllItems=true is still capped by maxItems to avoid runaway memory/time usage.
  • It does not perform semantic validation or AI-based judgement.
  • Future expansion candidates: Dataset Diff, PII Scanner, RAG Readiness Checker, and webhook/Slack failure notifications.

FAQ

Does it use proxies, cookies, or third-party APIs?

No. It validates JSON items and can read Apify Dataset/Run data available to the running account. It does not scrape websites, use proxies, or call external AI APIs.

Does a failed quality gate still produce a report?

Yes. In fail-gate mode, the Actor writes OUTPUT and REPORT first, then fails the run so your automation can stop while diagnostics remain available.

What should I use as uniqueFields?

Good candidates are stable identifiers such as id, url, sku, email, profileUrl, productUrl, or any field that should appear only once per dataset.

Is this only for AI/RAG workflows?

No. AI/RAG ingestion is a strong use case, but the Actor is equally useful for customer data delivery, lead lists, product datasets, SEO datasets, scheduled audits, and webhook-based automation.

Local development

npm install
npm test
npm run local

Apify local runtime smoke:

apify run --purge --input-file INPUT.pass.example.json
apify run --purge --input-file INPUT.fail.example.json
apify run --purge --input-file INPUT.fail-gate.example.json