Ai Data Quality Guardian avatar
Ai Data Quality Guardian
Under maintenance

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Ai Data Quality Guardian

Ai Data Quality Guardian

Under maintenance

Validate, clean, and score datasets automatically. Detect anomalies, schema drift, duplicates, and data quality issues to produce reliable, structured outputs for analytics and automation workflows.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Hayder Al-Khalissi

Hayder Al-Khalissi

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

15 hours ago

Last modified

Share

A production-ready Apify Actor for validating, scoring, and cleaning Apify datasets. It detects anomalies, schema drift, missing fields, and suspicious values, then outputs clean data, a quality report, and a quarantine dataset for invalid rows.


Table of Contents


Overview

AI Data Quality Guardian runs as an Apify Actor and processes any dataset in your Apify account (or shared with you). It streams rows, validates them against configurable rules, detects duplicates and schema drift, scores confidence per row, and optionally applies auto-fixes. Valid rows are written to a clean dataset; invalid or low-confidence rows are written to a quarantine dataset with reasons and details. A quality report is stored in the run’s key-value store for monitoring and pipeline gates.


Features

CapabilityDescription
ValidationEnforces required fields, non-empty values, and numeric bounds (min/max per field).
Schema driftCompares each row’s structure to a baseline and flags new or missing fields.
Duplicate detectionIdentifies duplicate rows via configurable key fields (e.g. url, id).
Anomaly detectionFlags extreme numeric values (e.g. beyond 3σ), missing critical data, and suspicious formats (e.g. URL-like fields not starting with http).
Confidence scoringAssigns each row a 0–100 score from validation, drift, duplicates, and anomalies.
Auto-fixOptionally trims strings, normalizes numbers, and removes empty fields on valid rows.
QuarantineWrites invalid or low-confidence rows to a separate dataset with reason and details.

Quick Start

Prerequisites

Running the Actor

  1. Obtain a dataset ID
    Use the default dataset ID from a previous Actor run, or pick any dataset from Storage → Datasets. Dataset IDs use lowercase letters, digits, and hyphens (e.g. my-dataset-1).

  2. Prepare input

    • Minimal: examples/example_input_minimal.json — set datasetId only (or leave as YOUR_DATASET_ID to use the run’s default dataset).
    • Full: examples/example_input_apify.json — validation rules, duplicate detection, auto-fix, and other options.
  3. Start a run

    • Console: Open the Actor → Input tab → paste your JSON → Start.
    • CLI: apify run -p examples/example_input_apify.json (from project root; set datasetId in the file first).
    • API: POST https://api.apify.com/v2/acts/<ACTOR_ID>/runs with your input JSON as the request body.

See examples/README.md for detailed steps and API/CLI examples.


Input

Minimal

{
"datasetId": "your-dataset-id-here"
}

Omitting datasetId or using the placeholder YOUR_DATASET_ID causes the Actor to use the run’s default dataset (useful for quick tests or automated QA).

Full configuration

{
"datasetId": "your-dataset-id-here",
"requiredfields": ["url", "title", "price"],
"numericRules": {
"price": { "min": 0, "max": 100000 },
"rating": { "min": 0, "max": 5 }
},
"schemaDriftDetection": true,
"duplicateDetection": {
"enabled": true,
"keyFields": ["url", "id"]
},
"anomalyDetection": true,
"confidenceScoring": true,
"autoFix": {
"trimStrings": true,
"normalizeNumbers": true,
"removeEmptyFields": true
},
"quarantineBadRows": true,
"debug": false
}

All options are documented in the Actor’s .actor/input_schema.json.


Output

Clean dataset

  • One item per valid row.
  • If autoFix is enabled, items are the fixed rows (trimmed, normalized, empty fields removed).
  • A row is valid when it has no validation errors and (if confidence scoring is on) confidence ≥ 50.

Quarantine dataset

  • One item per invalid or failed row.
  • Each item includes:
    • row — the row (fixed if autoFix was applied).
    • reason — short reason (e.g. "Validation failed", "Low confidence score", "Processing error").
    • details — array of messages (validation errors, drift, duplicate, anomalies).
    • score — confidence score (if confidence scoring is enabled).

Key-value store

KeyDescription
QUALITY_REPORTSummary of the run (counts, average confidence, quarantine reasons, drift summary).
OUTPUTObject with qualityReport, cleanDatasetId, and quarantineDatasetId.
SCHEMA_BASELINEBaseline schema used for drift detection; can be reused in later runs.

Example QUALITY_REPORT:

{
"totalRows": 1000,
"validRows": 950,
"invalidRows": 50,
"duplicates": 5,
"anomalies": 12,
"schemaChanges": 3,
"averageConfidence": 87.5,
"quarantineReasonCounts": { "Validation failed": 30, "Low confidence score": 20 },
"driftFieldsSummary": ["new:extraField", "missing:oldField"]
}

Use Cases

  • Post-scrape quality control — Run after a scraper to validate and clean its dataset before downstream use.
  • Pipeline gates — Use the quality report (e.g. invalidRows, averageConfidence) to fail or alert when quality drops below a threshold.
  • Monitoring — Schedule the Actor on a dataset to track schema drift and anomalies over time.
  • Data cleaning — Use auto-fix and the clean dataset as the single source of validated records.

Integrations

n8n

Run the Actor from n8n using the Apify community node (@apify/n8n-nodes-apify).

  • Prerequisites: Install the Apify node and create Apify API credentials (your API token).
  • Basic flow: Run Actor (with “Wait for finish”) → Get Record (key OUTPUT from run’s default key-value store) → Get Items (using cleanDatasetId from OUTPUT).
  • Branching: Use qualityReport.invalidRows, qualityReport.averageConfidence, or qualityReport.schemaChanges in an IF or Switch node to alert or branch.

An importable workflow is provided: integrations/n8n-workflow-example.json. See integrations/README.md for more detail.

OpenClaw

OpenClaw can receive run summaries and trigger the Actor via skills or the Exec tool.

  • Webhook: Set OpenClaw webhook URL and OpenClaw webhook token in the Actor input. When the run completes, the Actor POSTs a short summary to that URL.
  • Triggering: Use the Exec tool with the Apify API, or install the integrations/openclaw/skill and set APIFY_TOKEN (and optionally Actor ID) in your OpenClaw config.

Full details: integrations/openclaw/README.md.

General integration tips

  1. Chain after another Actor — Use a scraper’s or API Actor’s output dataset as datasetId; consume the clean dataset in your pipeline.
  2. Branch on quality — Read QUALITY_REPORT from the run’s key-value store and branch or alert when invalidRows or averageConfidence crosses a threshold.
  3. Inspect quarantine — Use the quarantine dataset to review bad rows and tune requiredfields, numericRules, or duplicate key fields.
  4. Debug — Set "debug": true to log validation reasoning and per-row details.

Development

Tech stack

  • Node.js 18+
  • Apify SDK — dataset streaming via Dataset.forEach; no browser or crawler.

Local testing

npm run test:seed # Seed local storage with sample input and dataset
npm run test # Run the Actor locally (uses ./storage)

From project root, ensure CRAWLEE_STORAGE_DIR points to your storage directory (default: ./storage).

Memory considerations

Duplicate detection keeps an in-memory set of hashes for the configured key fields. For very large datasets with many unique keys, memory use can grow. If you hit limits, disable duplicate detection or reduce the number of key fields.


License

Licensed under the LICENSE.