Dataset Quality Gate - Schema & Data QA
Pricing
Pay per usage
Dataset Quality Gate - Schema & Data QA
Validate Apify Datasets by pasted items, Dataset ID, or Run ID before delivery, automation, or AI/RAG ingestion. Catch schema drift, missing fields, duplicates, and bad URLs/emails/dates.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Juyeop Park
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Dataset Quality Gate - Schema & Data QA for Apify
Stop bad JSON data before it reaches customers, automations, or AI/RAG pipelines.
Dataset Quality Gate validates Apify Dataset items and turns them into a clear pass/fail quality report. You can now use it in three ways:
- Paste JSON items directly.
- Provide an Apify Dataset ID and let the Actor fetch the items.
- Provide an Apify Run ID and let the Actor validate that run's default dataset.
It checks required fields, schema/type drift, duplicate IDs or URLs, URL/email/date formats, and field-level profiles — without external websites, proxies, cookies, AI tokens, or third-party APIs.
Best for
- Scraped data delivery QA — catch missing URLs, empty titles, malformed emails, bad dates, and duplicate records before a customer sees them.
- AI/RAG ingestion guardrails — reject malformed records before embeddings, retrieval, or agent workflows consume them.
- Apify workflow monitoring — validate the default dataset from a finished Actor run by pasting only the
runId. - Schema drift detection — confirm that fields remain strings, numbers, booleans, arrays, objects, or nulls as expected.
- Automation fail gates — stop CI/CD, webhook, or scheduled pipelines when data quality falls below the expected contract.
Why use this instead of manual inspection?
Manual spot checks miss silent failures: one upstream selector change can leave hundreds of records with missing URLs, empty titles, duplicated IDs, or type changes that only break later. Dataset Quality Gate makes those checks repeatable and machine-readable while still producing a Markdown report for human review.
Input modes
1. Validate by Run ID — easiest for Apify workflows
Use this when you want to check the output of a previous Actor run.
{"reportName": "Daily scraper QA","runId": "YOUR_ACTOR_RUN_ID","maxItems": 1000,"validateAllItems": false,"requiredFields": ["id", "url", "title"],"uniqueFields": ["url"],"formatRules": { "url": "url" }}
The Actor resolves the run's default dataset automatically.
2. Validate by Dataset ID
Use this when you already know the dataset to inspect.
{"reportName": "Customer delivery dataset QA","datasetId": "YOUR_DATASET_ID","maxItems": 5000,"validateAllItems": true,"requiredFields": ["id", "url", "email"],"uniqueFields": ["id", "url"],"formatRules": {"url": "url","email": "email"}}
validateAllItems=true paginates through the dataset until it is exhausted or maxItems is reached.
3. Validate pasted JSON items
Use this for a quick manual sample or local testing.
{"reportName": "Daily lead dataset QA","items": [{"id": "lead-1","url": "https://example.com/company-a","email": "owner@example.com","publishedAt": "2026-05-20"},{"id": "lead-2","url": "https://example.com/company-b","email": "sales@example.com","publishedAt": "2026-05-21"}],"requiredFields": ["id", "url", "email"],"expectedSchema": {"id": "string","url": "string","email": "string","publishedAt": "string"},"uniqueFields": ["id", "url"],"formatRules": {"url": "url","email": "email","publishedAt": "date"}}
If datasetId or runId is provided, the external source is used and pasted items are ignored.
Sampling and full-dataset controls
maxItems— safety cap for how many dataset records to check. Default:1000, maximum:50000.validateAllItems=false— fetch one sample page usingmaxItems.validateAllItems=true— paginate until the dataset is exhausted ormaxItemsis reached.itemOffset— skip records before validation.itemOrder=first— validate the earliest stored items first.itemOrder=last— validate the latest stored items first.
The output includes source metadata such as source type, dataset ID, run ID, checked item count, available item count, and whether the source was truncated by the configured limit.
What it checks
- Required field presence and non-empty values
- Expected top-level JSON types:
string,number,boolean,object,array,null - Duplicate values for configured unique fields
- URL, email, and date format checks
- Field-level profiling:
- present / missing counts
- null counts
- empty string counts
- type distribution
- sample values
Supported expectedSchema values: string, number, boolean, object, array, null.
Supported formatRules values: url, email, date.
Unsupported rule values fail fast with a clear input error.
Run modes
1. Report-only mode — default
Set failRunOnError=false.
The Actor writes the summary dataset item plus full OUTPUT and REPORT artifacts. The Actor run can still succeed even when the quality report status is fail.
Use this for audits, dashboards, scheduled QA reports, manual review, and exploratory checks.
2. Fail-gate mode
Set failRunOnError=true.
The Actor writes the full JSON and Markdown report first, then fails the run if validation errors are found.
Use this for CI/CD gates, webhook pipelines, production automations, and any workflow where bad data must stop the next step.
Output
The Actor pushes one summary item to the default dataset:
{"type": "dataset_quality_report","status": "fail","passed": false,"reportName": "Daily lead dataset QA","totalItems": 1000,"fieldCount": 12,"totalIssues": 5,"errorCount": 5,"warningCount": 0,"issueCountsByCode": {"REQUIRED_FIELD_MISSING": 1,"DUPLICATE_UNIQUE_FIELD": 1,"FORMAT_INVALID": 3},"sourceType": "run","sourceDatasetId": "abc123","sourceRunId": "run123","sourceLoadedItems": 1000,"sourceTotalAvailable": 2500,"sourceTruncated": true}
It also stores full artifacts in the default key-value store:
OUTPUT— full JSON report including all issues, field profiles, and source metadataREPORT— Markdown report for human review and sharing
Common workflow recipes
Customer delivery QA
{"runId": "YOUR_SCRAPER_RUN_ID","requiredFields": ["id", "url", "title"],"uniqueFields": ["id", "url"],"formatRules": { "url": "url" }}
AI/RAG ingestion QA
{"datasetId": "YOUR_DATASET_ID","requiredFields": ["url", "title", "text"],"expectedSchema": {"url": "string","title": "string","text": "string"},"uniqueFields": ["url"],"formatRules": { "url": "url" }}
Production fail gate
{"runId": "YOUR_UPSTREAM_RUN_ID","failRunOnError": true,"requiredFields": ["id", "url", "email"],"uniqueFields": ["id", "email"],"formatRules": { "url": "url", "email": "email" }}
Pricing
This Actor is configured for pay per event (PPE) pricing.
- Quality report:
$0.05per completed quality report - Actor start:
$0.005per run start
The pricing is intentionally simple and predictable: you pay for one completed dataset QA report, not for external API calls or proxy usage. The Actor does not call AI APIs or third-party scraping services.
Limitations
- Validation is top-level JSON field validation only.
- Nested object schema validation is not included yet.
runIdanddatasetIdmust be accessible from the account running the Actor.validateAllItems=trueis still capped bymaxItemsto avoid runaway memory/time usage.- It does not perform semantic validation or AI-based judgement.
- Future expansion candidates: Dataset Diff, PII Scanner, RAG Readiness Checker, and webhook/Slack failure notifications.
FAQ
Does it use proxies, cookies, or third-party APIs?
No. It validates JSON items and can read Apify Dataset/Run data available to the running account. It does not scrape websites, use proxies, or call external AI APIs.
Does a failed quality gate still produce a report?
Yes. In fail-gate mode, the Actor writes OUTPUT and REPORT first, then fails the run so your automation can stop while diagnostics remain available.
What should I use as uniqueFields?
Good candidates are stable identifiers such as id, url, sku, email, profileUrl, productUrl, or any field that should appear only once per dataset.
Is this only for AI/RAG workflows?
No. AI/RAG ingestion is a strong use case, but the Actor is equally useful for customer data delivery, lead lists, product datasets, SEO datasets, scheduled audits, and webhook-based automation.
Local development
npm installnpm testnpm run local
Apify local runtime smoke:
apify run --purge --input-file INPUT.pass.example.jsonapify run --purge --input-file INPUT.fail.example.jsonapify run --purge --input-file INPUT.fail-gate.example.json