Pricing

from $3.90 / 1,000 record checkeds

Dataset QA Auditor

Validate Apify dataset rows for schema drift, null spikes, duplicate keys, type mismatches, and delivery-readiness issues. Outputs row-level QA results plus a summary report.

Pricing

from $3.90 / 1,000 record checkeds

Rating

0.0

(0)

Developer

junipr

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

Store Positioning

Store title: Dataset QA Auditor

Short description: Validate Apify dataset rows for schema drift, null spikes, duplicate keys, type mismatches, and delivery-readiness issues. Outputs row-level QA results plus a summary report.

SEO title: Dataset QA Auditor — data QA, validation, and cleanup utility

SEO description: Validate Apify dataset rows for schema drift, null spikes, duplicate keys, type mismatches, and delivery-readiness issues. Outputs row-level QA results plus a summary report. Use it to validate rows, schemas, duplicates, field quality, and delivery-readiness before handing data to clients or automations.

Categories: SEO_TOOLS, DEVELOPER_TOOLS

Keywords: dataset, qa, auditor, apify actor, schema, data qa, dataset qa, api testing, data/schema qa utility

Fixed-Inclusive PPE Pricing

This actor uses pay-per-event pricing. Event prices include Apify platform usage; users are not expected to pay a separate platform-usage pass-through charge for the configured pricing model.

Tier: U2 — Data/schema QA utility
Primary event: record-checked at $0.00390 base
Default max charge: $5.00
Store discounts: FREE/BRONZE base, SILVER discounted, GOLD deepest approved discount

Event set:

actor-start: base $0.00500, GOLD $0.00400. Dataset QA Auditor: charged when actor start is completed. The price includes Apify platform usage; no separate usage pass-through is intended.
record-checked: base $0.00390, GOLD $0.00312. Dataset QA Auditor: charged when record checked is completed. The price includes Apify platform usage; no separate usage pass-through is intended.
issue-detected: base $0.00372, GOLD $0.00298. Dataset QA Auditor: charged when issue detected is completed. The price includes Apify platform usage; no separate usage pass-through is intended.
qa-report-generated: base $0.05000, GOLD $0.04000. Dataset QA Auditor: charged when qa report generated is completed. The price includes Apify platform usage; no separate usage pass-through is intended.

Public Task Concepts

Audit Dataset QA controls on a capped public sample
Find high-priority Dataset QA issues before release
Validate Dataset QA evidence from supplied pages
Prioritize Dataset QA fixes with severity and proof
Export Dataset QA QA rows for client review

Validate Apify datasets for schema drift, null spikes, duplicate rows, type errors, and delivery readiness before you hand data to a client, importer, automation, or downstream model.

What This Actor Does

Dataset QA Auditor checks either inline records or an Apify dataset sample and emits row-level QA results. It is built for the common failure modes that make scraped or transformed datasets painful to deliver:

Duplicate business keys such as repeated product IDs, lead IDs, or URLs.
Missing fields, empty fields, and null spikes across a dataset.
Type drift such as a numeric field becoming a string.
Semantic field validation for URLs, emails, dates, integers, arrays, objects, booleans, and nulls.
Unexpected fields when strict schema checks are enabled.
A Markdown and JSON summary report in the key-value store.

What This Actor Does Not Do

It does not enrich, scrape, or collect new data.
It does not validate sensitive personal data or provide legal, medical, or financial advice.
It does not guarantee a dataset is compliant with any regulation.
It does not automatically fix rows; it tells you exactly what should be fixed before delivery.

Best Use Cases

QA a scraper output before sending it to a customer.
Detect schema drift after changing selectors or parsers.
Check lead lists, product catalogs, app-review exports, job datasets, or monitoring feeds before import.
Generate a concise QA delivery report for a dataset delivery workflow.
Run a scheduled check against a stable Apify dataset sample.

Input Fields

records: Inline object rows to audit. The default sample intentionally contains duplicate, null, and type issues so the zero-config run demonstrates useful output.
datasetId: Optional Apify dataset ID. When present, the actor reads rows from that dataset instead of records.
expectedSchema: Field-to-type map. Supported types are string, number, integer, boolean, object, array, null, date, url, and email.
keyFields: One or more fields used to detect duplicate rows.
strictSchema: Treat missing and unexpected fields as stronger readiness issues.
nullSpikeThreshold: Warn when a field is null or missing in at least this share of checked rows.
maxItems: Maximum records to check. The hard cap is 10,000 and the default is 50.
includeReport: Store Markdown and JSON reports in the default key-value store.
debug: Enable extra troubleshooting logs.

Example Input

{
  "records": [
    {
      "id": "lead-001",
      "company": "Northwind Studio",
      "website": "https://example.com",
      "email": "hello@example.com",
      "employeeCount": 12,
      "country": "US"
    },
    {
      "id": "lead-002",
      "company": "Contoso Works",
      "website": "https://example.org",
      "email": null,
      "employeeCount": "27",
      "country": "US"
    }
  ],
  "expectedSchema": {
    "id": "string",
    "company": "string",
    "website": "url",
    "email": "email",
    "employeeCount": "integer",
    "country": "string"
  },
  "keyFields": ["id"],
  "maxItems": 50,
  "includeReport": true
}

Output Fields

Each dataset item represents one checked row:

auditId: Stable hash for the checked row.
sourceType and sourceId: Whether the row came from inline input or an Apify dataset.
rowIndex: Zero-based row index in the checked input.
recordKey: Duplicate-detection key assembled from keyFields.
status: pass, warn, or fail.
severity: Highest issue severity on the row.
issues: Detailed issue list with code, field, severity, and message.
duplicateKey: Duplicate key value when this row repeats an earlier row.
missingFields, extraFields, nullFields, and typeMismatches: Structured issue details.
recommendation: Short next step for the row.

The key-value store also contains:

QA_REPORT.md: Human-readable summary.
QA_SUMMARY.json: Machine-readable summary.

Example Output

{
  "auditId": "da3735af7de3eca7",
  "sourceType": "inline",
  "sourceId": "inline-records",
  "rowIndex": 1,
  "recordKey": "lead-002",
  "status": "fail",
  "severity": "high",
  "issueCount": 2,
  "issues": [
    {
      "code": "missing-field",
      "severity": "medium",
      "field": "email",
      "message": "Expected field \"email\" is missing or empty."
    },
    {
      "code": "type-mismatch",
      "severity": "high",
      "field": "employeeCount",
      "message": "Expected integer, received string."
    }
  ],
  "duplicateKey": null,
  "missingFields": ["email"],
  "extraFields": [],
  "nullFields": ["email"],
  "typeMismatches": [
    {
      "field": "employeeCount",
      "expected": "integer",
      "actual": "string",
      "valuePreview": "27"
    }
  ],
  "fieldCount": 6,
  "checkedAt": "2026-07-02T00:00:00.000Z",
  "recommendation": "Normalize mismatched field types before delivery."
}

Cost-Control Tips

Start with the default sample or maxItems between 10 and 100.
Use datasetId with a small cap before auditing a full production dataset.
Keep includeReport on for normal runs; turn it off only when you need row-level output without a summary artifact.
Use stable keyFields such as id, url, sku, or a composite key to avoid noisy duplicate results.

Scheduling Examples

Run after every scraper deployment to catch schema drift.
Schedule daily against a small recent sample from a production dataset.
Trigger before exporting a dataset to a client, CRM, spreadsheet, warehouse, or RAG pipeline.

Public Task Examples

This actor includes five prepared task concepts:

Lead list delivery-readiness check.
Scraper output schema-drift check.
Marketplace catalog null-spike check.
Apify dataset duplicate-key check.
Client delivery QA summary.

FAQ

Can this audit private Apify datasets?

Yes, when the run has access to the dataset ID through your Apify account. Keep the sample size low until the schema is configured correctly.

What happens if I do not provide an expected schema?

The actor infers a schema from the first non-null values in the checked records. For production QA, a configured expectedSchema is better because it catches drift against your intended output.

Does it mutate or clean my dataset?

No. It only reads input rows, pushes QA result rows, and writes summary artifacts.

Are diagnostics billed as dataset rows?

Troubleshooting

No records to audit: Provide records or a datasetId that contains object rows.
Many invalid-key issues: Set keyFields to fields that actually exist and are populated.
Too many type mismatches: Check whether your expectedSchema type names match the supported type list.
Null spikes are too noisy: Raise nullSpikeThreshold, or narrow the expected schema to fields required for delivery.

Limitations

Only top-level fields are checked in this first version.
Semantic checks are intentionally conservative for URLs, emails, and dates.
Large datasets should be sampled first because every checked row is billable.
This is a QA signal, not a compliance guarantee.

Source And Safety Notes

This actor does not scrape websites or enrich records. It processes user-provided rows or datasets that the user already has permission to access. Do not upload sensitive personal data unless you are authorized to process it in Apify.

Changelog

1.0.0: Initial production build with duplicate-key detection, schema checks, null-spike reporting, KVS summaries, PPE billing, examples, and fixture tests.

Site QA Content Report Scraper

taroyamada/site-qa-content-report-scraper

Audit public web pages for content quality issues and generate source-linked QA report rows.

naoki anzai

Dataset Quality Gate - Schema & Data QA

jy-labs/dataset-quality-gate

Apify Dataset data validation and schema validation for JSON rows. Check Dataset ID, Run ID, or pasted items before delivery, automation, or AI/RAG ingestion.

Juyeop Park

Apify Dataset Release QA Gate

rotvuvo/apify-dataset-release-qa-gate

Audit imported Apify-style dataset rows for missing required values, normalized duplicates, field coverage, type drift, URL/domain issues, and invalid emails. Get one PASS/WARN/FAIL report with blockers and sampled problem rows. This Actor does not fetch datasets by ID or mutate data.