Ai Data Quality Guardian
Pricing
from $1.00 / 1,000 results
Ai Data Quality Guardian
Validate, clean, and score datasets automatically. Detect anomalies, schema drift, duplicates, and data quality issues to produce reliable, structured outputs for analytics and automation workflows.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer

Hayder Al-Khalissi
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
A production-ready Apify Actor for validating, scoring, and cleaning Apify datasets. It detects anomalies, schema drift, missing fields, and suspicious values, then outputs clean data, a quality report, and a quarantine dataset for invalid rows.
Table of Contents
Overview
AI Data Quality Guardian runs as an Apify Actor and processes any dataset in your Apify account (or shared with you). It streams rows, validates them against configurable rules, detects duplicates and schema drift, scores confidence per row, and optionally applies auto-fixes. Valid rows are written to a clean dataset; invalid or low-confidence rows are written to a quarantine dataset with reasons and details. A quality report is stored in the run’s key-value store for monitoring and pipeline gates.
Features
| Capability | Description |
|---|---|
| Validation | Enforces required fields, non-empty values, and numeric bounds (min/max per field). |
| Schema drift | Compares each row’s structure to a baseline and flags new or missing fields. |
| Duplicate detection | Identifies duplicate rows via configurable key fields (e.g. url, id). |
| Anomaly detection | Flags extreme numeric values (e.g. beyond 3σ), missing critical data, and suspicious formats (e.g. URL-like fields not starting with http). |
| Confidence scoring | Assigns each row a 0–100 score from validation, drift, duplicates, and anomalies. |
| Auto-fix | Optionally trims strings, normalizes numbers, and removes empty fields on valid rows. |
| Quarantine | Writes invalid or low-confidence rows to a separate dataset with reason and details. |
Quick Start
Prerequisites
- An Apify account.
- A dataset to validate (e.g. from another Actor run or from Storage → Datasets).
Running the Actor
-
Obtain a dataset ID
Use the default dataset ID from a previous Actor run, or pick any dataset from Storage → Datasets. Dataset IDs use lowercase letters, digits, and hyphens (e.g.my-dataset-1). -
Prepare input
- Minimal: examples/example_input_minimal.json — set
datasetIdonly (or leave asYOUR_DATASET_IDto use the run’s default dataset). - Full: examples/example_input_apify.json — validation rules, duplicate detection, auto-fix, and other options.
- Minimal: examples/example_input_minimal.json — set
-
Start a run
- Console: Open the Actor → Input tab → paste your JSON → Start.
- CLI:
apify run -p examples/example_input_apify.json(from project root; setdatasetIdin the file first). - API:
POST https://api.apify.com/v2/acts/<ACTOR_ID>/runswith your input JSON as the request body.
See examples/README.md for detailed steps and API/CLI examples.
Input
Minimal
{"datasetId": "your-dataset-id-here"}
Omitting datasetId or using the placeholder YOUR_DATASET_ID causes the Actor to use the run’s default dataset (useful for quick tests or automated QA).
Full configuration
{"datasetId": "your-dataset-id-here","requiredfields": ["url", "title", "price"],"numericRules": {"price": { "min": 0, "max": 100000 },"rating": { "min": 0, "max": 5 }},"schemaDriftDetection": true,"duplicateDetection": {"enabled": true,"keyFields": ["url", "id"]},"anomalyDetection": true,"confidenceScoring": true,"autoFix": {"trimStrings": true,"normalizeNumbers": true,"removeEmptyFields": true},"quarantineBadRows": true,"debug": false}
All options are documented in the Actor’s .actor/input_schema.json.
Output
Clean dataset
- One item per valid row.
- If autoFix is enabled, items are the fixed rows (trimmed, normalized, empty fields removed).
- A row is valid when it has no validation errors and (if confidence scoring is on) confidence ≥ 50.
Quarantine dataset
- One item per invalid or failed row.
- Each item includes:
row— the row (fixed if autoFix was applied).reason— short reason (e.g."Validation failed","Low confidence score","Processing error").details— array of messages (validation errors, drift, duplicate, anomalies).score— confidence score (if confidence scoring is enabled).
Key-value store
| Key | Description |
|---|---|
| QUALITY_REPORT | Summary of the run (counts, average confidence, quarantine reasons, drift summary). |
| OUTPUT | Object with qualityReport, cleanDatasetId, and quarantineDatasetId. |
| SCHEMA_BASELINE | Baseline schema used for drift detection; can be reused in later runs. |
Example QUALITY_REPORT:
{"totalRows": 1000,"validRows": 950,"invalidRows": 50,"duplicates": 5,"anomalies": 12,"schemaChanges": 3,"averageConfidence": 87.5,"quarantineReasonCounts": { "Validation failed": 30, "Low confidence score": 20 },"driftFieldsSummary": ["new:extraField", "missing:oldField"]}
Use Cases
- Post-scrape quality control — Run after a scraper to validate and clean its dataset before downstream use.
- Pipeline gates — Use the quality report (e.g.
invalidRows,averageConfidence) to fail or alert when quality drops below a threshold. - Monitoring — Schedule the Actor on a dataset to track schema drift and anomalies over time.
- Data cleaning — Use auto-fix and the clean dataset as the single source of validated records.
Integrations
n8n
Run the Actor from n8n using the Apify community node (@apify/n8n-nodes-apify).
- Prerequisites: Install the Apify node and create Apify API credentials (your API token).
- Basic flow: Run Actor (with “Wait for finish”) → Get Record (key
OUTPUTfrom run’s default key-value store) → Get Items (usingcleanDatasetIdfrom OUTPUT). - Branching: Use
qualityReport.invalidRows,qualityReport.averageConfidence, orqualityReport.schemaChangesin an IF or Switch node to alert or branch.
An importable workflow is provided: integrations/n8n-workflow-example.json. See integrations/README.md for more detail.
OpenClaw
OpenClaw can receive run summaries and trigger the Actor via skills or the Exec tool.
- Webhook: Set OpenClaw webhook URL and OpenClaw webhook token in the Actor input. When the run completes, the Actor POSTs a short summary to that URL.
- Triggering: Use the Exec tool with the Apify API, or install the integrations/openclaw/skill and set
APIFY_TOKEN(and optionally Actor ID) in your OpenClaw config.
Full details: integrations/openclaw/README.md.
General integration tips
- Chain after another Actor — Use a scraper’s or API Actor’s output dataset as
datasetId; consume the clean dataset in your pipeline. - Branch on quality — Read
QUALITY_REPORTfrom the run’s key-value store and branch or alert wheninvalidRowsoraverageConfidencecrosses a threshold. - Inspect quarantine — Use the quarantine dataset to review bad rows and tune
requiredfields,numericRules, or duplicate key fields. - Debug — Set
"debug": trueto log validation reasoning and per-row details.
Development
Tech stack
- Node.js 18+
- Apify SDK — dataset streaming via
Dataset.forEach; no browser or crawler.
Local testing
npm run test:seed # Seed local storage with sample input and datasetnpm run test # Run the Actor locally (uses ./storage)
From project root, ensure CRAWLEE_STORAGE_DIR points to your storage directory (default: ./storage).
Memory considerations
Duplicate detection keeps an in-memory set of hashes for the configured key fields. For very large datasets with many unique keys, memory use can grow. If you hit limits, disable duplicate detection or reduce the number of key fields.
License
Licensed under the LICENSE.