Pricing

Pay per usage

Dataset Quality Gate - Schema & Data QA

Validate Apify Datasets by pasted items, Dataset ID, or Run ID before delivery, automation, or AI/RAG ingestion. Catch schema drift, missing fields, duplicates, and bad URLs/emails/dates.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Juyeop Park

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Dataset Quality Gate - Schema & Data QA for Apify

Stop bad JSON data before it reaches customers, automations, or AI/RAG pipelines.

Dataset Quality Gate validates Apify Dataset items and turns them into a clear pass/fail quality report. You can now use it in three ways:

Paste JSON items directly.
Provide an Apify Dataset ID and let the Actor fetch the items.
Provide an Apify Run ID and let the Actor validate that run's default dataset.

It checks required fields, schema/type drift, duplicate IDs or URLs, URL/email/date formats, and field-level profiles — without external websites, proxies, cookies, AI tokens, or third-party APIs.

Best for

Scraped data delivery QA — catch missing URLs, empty titles, malformed emails, bad dates, and duplicate records before a customer sees them.
AI/RAG ingestion guardrails — reject malformed records before embeddings, retrieval, or agent workflows consume them.
Apify workflow monitoring — validate the default dataset from a finished Actor run by pasting only the runId.
Schema drift detection — confirm that fields remain strings, numbers, booleans, arrays, objects, or nulls as expected.
Automation fail gates — stop CI/CD, webhook, or scheduled pipelines when data quality falls below the expected contract.

Why use this instead of manual inspection?

Manual spot checks miss silent failures: one upstream selector change can leave hundreds of records with missing URLs, empty titles, duplicated IDs, or type changes that only break later. Dataset Quality Gate makes those checks repeatable and machine-readable while still producing a Markdown report for human review.

Input modes

1. Validate by Run ID — easiest for Apify workflows

Use this when you want to check the output of a previous Actor run.

{
  "reportName": "Daily scraper QA",
  "runId": "YOUR_ACTOR_RUN_ID",
  "maxItems": 1000,
  "validateAllItems": false,
  "requiredFields": ["id", "url", "title"],
  "uniqueFields": ["url"],
  "formatRules": { "url": "url" }
}

The Actor resolves the run's default dataset automatically.

2. Validate by Dataset ID

Use this when you already know the dataset to inspect.

{
  "reportName": "Customer delivery dataset QA",
  "datasetId": "YOUR_DATASET_ID",
  "maxItems": 5000,
  "validateAllItems": true,
  "requiredFields": ["id", "url", "email"],
  "uniqueFields": ["id", "url"],
  "formatRules": {
    "url": "url",
    "email": "email"
  }
}

validateAllItems=true paginates through the dataset until it is exhausted or maxItems is reached.

3. Validate pasted JSON items

Use this for a quick manual sample or local testing.

{
  "reportName": "Daily lead dataset QA",
  "items": [
    {
      "id": "lead-1",
      "url": "https://example.com/company-a",
      "email": "owner@example.com",
      "publishedAt": "2026-05-20"
    },
    {
      "id": "lead-2",
      "url": "https://example.com/company-b",
      "email": "sales@example.com",
      "publishedAt": "2026-05-21"
    }
  ],
  "requiredFields": ["id", "url", "email"],
  "expectedSchema": {
    "id": "string",
    "url": "string",
    "email": "string",
    "publishedAt": "string"
  },
  "uniqueFields": ["id", "url"],
  "formatRules": {
    "url": "url",
    "email": "email",
    "publishedAt": "date"
  }
}

If datasetId or runId is provided, the external source is used and pasted items are ignored.

Sampling and full-dataset controls

maxItems — safety cap for how many dataset records to check. Default: 1000, maximum: 50000.
validateAllItems=false — fetch one sample page using maxItems.
validateAllItems=true — paginate until the dataset is exhausted or maxItems is reached.
itemOffset — skip records before validation.
itemOrder=first — validate the earliest stored items first.
itemOrder=last — validate the latest stored items first.

The output includes source metadata such as source type, dataset ID, run ID, checked item count, available item count, and whether the source was truncated by the configured limit.

What it checks

Required field presence and non-empty values
Expected top-level JSON types: string, number, boolean, object, array, null
Duplicate values for configured unique fields
URL, email, and date format checks
Field-level profiling:
- present / missing counts
- null counts
- empty string counts
- type distribution
- sample values

Supported expectedSchema values: string, number, boolean, object, array, null.

Supported formatRules values: url, email, date.

Unsupported rule values fail fast with a clear input error.

Run modes

1. Report-only mode — default

Set failRunOnError=false.

The Actor writes the summary dataset item plus full OUTPUT and REPORT artifacts. The Actor run can still succeed even when the quality report status is fail.

Use this for audits, dashboards, scheduled QA reports, manual review, and exploratory checks.

2. Fail-gate mode

Set failRunOnError=true.

The Actor writes the full JSON and Markdown report first, then fails the run if validation errors are found.

Use this for CI/CD gates, webhook pipelines, production automations, and any workflow where bad data must stop the next step.

Output

The Actor pushes one summary item to the default dataset:

{
  "type": "dataset_quality_report",
  "status": "fail",
  "passed": false,
  "reportName": "Daily lead dataset QA",
  "totalItems": 1000,
  "fieldCount": 12,
  "totalIssues": 5,
  "errorCount": 5,
  "warningCount": 0,
  "issueCountsByCode": {
    "REQUIRED_FIELD_MISSING": 1,
    "DUPLICATE_UNIQUE_FIELD": 1,
    "FORMAT_INVALID": 3
  },
  "sourceType": "run",
  "sourceDatasetId": "abc123",
  "sourceRunId": "run123",
  "sourceLoadedItems": 1000,
  "sourceTotalAvailable": 2500,
  "sourceTruncated": true
}

It also stores full artifacts in the default key-value store:

OUTPUT — full JSON report including all issues, field profiles, and source metadata
REPORT — Markdown report for human review and sharing

Common workflow recipes

Customer delivery QA

{
  "runId": "YOUR_SCRAPER_RUN_ID",
  "requiredFields": ["id", "url", "title"],
  "uniqueFields": ["id", "url"],
  "formatRules": { "url": "url" }
}

AI/RAG ingestion QA

{
  "datasetId": "YOUR_DATASET_ID",
  "requiredFields": ["url", "title", "text"],
  "expectedSchema": {
    "url": "string",
    "title": "string",
    "text": "string"
  },
  "uniqueFields": ["url"],
  "formatRules": { "url": "url" }
}

Production fail gate

{
  "runId": "YOUR_UPSTREAM_RUN_ID",
  "failRunOnError": true,
  "requiredFields": ["id", "url", "email"],
  "uniqueFields": ["id", "email"],
  "formatRules": { "url": "url", "email": "email" }
}

Pricing

This Actor is configured for pay per event (PPE) pricing.

Quality report: $0.05 per completed quality report
Actor start: $0.005 per run start

The pricing is intentionally simple and predictable: you pay for one completed dataset QA report, not for external API calls or proxy usage. The Actor does not call AI APIs or third-party scraping services.

Limitations

Validation is top-level JSON field validation only.
Nested object schema validation is not included yet.
runId and datasetId must be accessible from the account running the Actor.
validateAllItems=true is still capped by maxItems to avoid runaway memory/time usage.
It does not perform semantic validation or AI-based judgement.
Future expansion candidates: Dataset Diff, PII Scanner, RAG Readiness Checker, and webhook/Slack failure notifications.

FAQ

Does it use proxies, cookies, or third-party APIs?

No. It validates JSON items and can read Apify Dataset/Run data available to the running account. It does not scrape websites, use proxies, or call external AI APIs.

Does a failed quality gate still produce a report?

Yes. In fail-gate mode, the Actor writes OUTPUT and REPORT first, then fails the run so your automation can stop while diagnostics remain available.

What should I use as `uniqueFields`?

Good candidates are stable identifiers such as id, url, sku, email, profileUrl, productUrl, or any field that should appear only once per dataset.

Is this only for AI/RAG workflows?

No. AI/RAG ingestion is a strong use case, but the Actor is equally useful for customer data delivery, lead lists, product datasets, SEO datasets, scheduled audits, and webhook-based automation.

Local development

npm install
npm test
npm run local

Apify local runtime smoke:

apify run --purge --input-file INPUT.pass.example.json
apify run --purge --input-file INPUT.fail.example.json
apify run --purge --input-file INPUT.fail-gate.example.json

Actor Release Gate — 9 Pre-Deploy Checks

ryanclinton/cicd-release-gate

Runs 9 pre-release checks on Apify actors before every deploy: input validation, run success, dataset quality, schema conformance, golden baselines, log anomalies. Gate 1-100+ actors per run. GitHub Actions integration. $0.10 per actor.

Ryan Clinton

Apify Dataset QA Gate

leadops_lab/dataset-quality-auditor

Pass, warn, or stop Apify datasets before CRM import, enrichment, client delivery, or webhook automation.

jiaxun mao

Dataset Quality Scorer

fiery_dream/dataset-quality-scorer

Score ML datasets for quality (completeness, consistency, duplicates, balance). Detect data drift, outliers, and recommend improvements.

Cody Churchwell

YouTube Sponsorship Intelligence — Sponsor-Ready Creators

ryanclinton/youtube-sponsorship-intelligence

Scores YouTube creators for sponsor-readiness, attaches verified business contacts, detects sponsor history. No charge for records that fail your quality gate. Watchlist mode tracks creator momentum over time.

Ryan Clinton

AI Domain Finder - Confirmed Available Domains

marielise.dev/ai-domain-finder

Describe your idea in plain English. Get up to 50 brandable domain names confirmed available via RDAP and WHOIS, scored on TLD fit, typo-resistance, and multi-language safety, each tagged register / consider / monitor / skip.

Marielise

Upwork Scraper - Freelance Job Listings with Client Intel

blackfalcondata/upwork-scraper

Scrape Upwork jobs with full client intelligence — country, total spent, payment-verified, rating, reviews, and exact applicant count. 14 filters and incremental mode that emits only new or changed listings across runs.

Black Falcon Data

Backstory Lead Enrichment, Person Lookup & Company Intelligence

logical_vivacity/backstory

Lead enrichment, person lookup & company intelligence from public sources. Pass any fragment — a name, email, domain, or handle — and get a structured dossier with verified identity, cross-platform handles, sanctions screening, and firmographics. No API keys. Pay per result.

Logical Vivacity

Image Quality Assessment

marielise.dev/image-quality-assessment

Analyze image quality with local Sharp.js processing - no API key required. Evaluates sharpness, exposure, noise, and contrast to return pass/fail verdict with detailed quality scores. Perfect for e-commerce product photos, UGC validation, and photo library curation. $0.01 per image analyzed.

Marielise

Salesforce Lead Pusher — Upsert Leads & Contacts

ryanclinton/salesforce-lead-pusher

Imports leads from any Apify scraper directly into Salesforce CRM as Leads, Contacts, or Accounts. Email deduplication, 200-record batch upserts, custom field mapping, and free dry-run preview. B2B CRM sync at $0.05 per lead created.

Ryan Clinton

Lead Data Quality Auditor — Email, Phone & Completeness Scoring

ryanclinton/enrichment-quality-auditor

Audit lead data quality before spending on outreach. Email verification, phone validation, domain freshness, completeness scoring. $0.005/record. Works with Clay, Apollo, or any CSV.

Ryan Clinton

Dataset Quality Gate - Schema & Data QA

Dataset Quality Gate - Schema & Data QA for Apify

Best for

Why use this instead of manual inspection?

Input modes

1. Validate by Run ID — easiest for Apify workflows

2. Validate by Dataset ID

3. Validate pasted JSON items

Sampling and full-dataset controls

What it checks

Run modes

1. Report-only mode — default

2. Fail-gate mode

Output

Common workflow recipes

Customer delivery QA

AI/RAG ingestion QA

Production fail gate

Pricing

Limitations

FAQ

Does it use proxies, cookies, or third-party APIs?

Does a failed quality gate still produce a report?

What should I use as uniqueFields?

Is this only for AI/RAG workflows?

Local development

You might also like

Actor Release Gate — 9 Pre-Deploy Checks

Apify Dataset QA Gate

Dataset Quality Scorer

YouTube Sponsorship Intelligence — Sponsor-Ready Creators

AI Domain Finder - Confirmed Available Domains

Upwork Scraper - Freelance Job Listings with Client Intel

Backstory Lead Enrichment, Person Lookup & Company Intelligence

Image Quality Assessment

Salesforce Lead Pusher — Upsert Leads & Contacts

Lead Data Quality Auditor — Email, Phone & Completeness Scoring

What should I use as `uniqueFields`?