Pricing

from $1.00 / 1,000 results

Ai Data Quality Guardian

Validate, clean, and score datasets automatically. Detect anomalies, schema drift, duplicates, and data quality issues to produce reliable, structured outputs for analytics and automation workflows.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Hayder Al-Khalissi

Actor stats

Bookmarked

Total users

Monthly active users

20 days ago

Last modified

Overview

AI Data Quality Guardian runs as an Apify Actor and processes any dataset in your Apify account (or shared with you). It streams rows, validates them against configurable rules, detects duplicates and schema drift, scores confidence per row, and optionally applies auto-fixes. Valid rows are written to a clean dataset; invalid or low-confidence rows are written to a quarantine dataset with reasons and details. A quality report is stored in the run’s key-value store for monitoring and pipeline gates.

Features

Capability	Description
Validation	Enforces required fields, non-empty values, and numeric bounds (min/max per field).
Schema drift	Compares each row’s structure to a baseline and flags new or missing fields.
Duplicate detection	Identifies duplicate rows via configurable key fields (e.g. `url`, `id`).
Anomaly detection	Flags extreme numeric values (e.g. beyond 3σ), missing critical data, and suspicious formats (e.g. URL-like fields not starting with `http`).
Confidence scoring	Assigns each row a 0–100 score from validation, drift, duplicates, and anomalies.
Auto-fix	Optionally trims strings, normalizes numbers, and removes empty fields on valid rows.
Quarantine	Writes invalid or low-confidence rows to a separate dataset with reason and details.

Quick Start

Prerequisites

An Apify account.
A dataset to validate (e.g. from another Actor run or from Storage → Datasets).

Running the Actor

Obtain a dataset ID
Use the default dataset ID from a previous Actor run, or pick any dataset from Storage → Datasets. Dataset IDs use lowercase letters, digits, and hyphens (e.g. my-dataset-1).
Prepare input
- Minimal: examples/example_input_minimal.json — set datasetId only (or leave as YOUR_DATASET_ID to use the run’s default dataset).
- Full: examples/example_input_apify.json — validation rules, duplicate detection, auto-fix, and other options.
Start a run
- Console: Open the Actor → Input tab → paste your JSON → Start.
- CLI: apify run -p examples/example_input_apify.json (from project root; set datasetId in the file first).
- API: POST https://api.apify.com/v2/acts/<ACTOR_ID>/runs with your input JSON as the request body.

See examples/README.md for detailed steps and API/CLI examples.

Input

Minimal

{
  "datasetId": "your-dataset-id-here"
}

Omitting datasetId or using the placeholder YOUR_DATASET_ID causes the Actor to use the run’s default dataset (useful for quick tests or automated QA).

Full configuration

{
  "datasetId": "your-dataset-id-here",
  "requiredfields": ["url", "title", "price"],
  "numericRules": {
    "price": { "min": 0, "max": 100000 },
    "rating": { "min": 0, "max": 5 }
  },
  "schemaDriftDetection": true,
  "duplicateDetection": {
    "enabled": true,
    "keyFields": ["url", "id"]
  },
  "anomalyDetection": true,
  "confidenceScoring": true,
  "autoFix": {
    "trimStrings": true,
    "normalizeNumbers": true,
    "removeEmptyFields": true
  },
  "quarantineBadRows": true,
  "debug": false
}

All options are documented in the Actor’s .actor/input_schema.json.

Output

Clean dataset

One item per valid row.
If autoFix is enabled, items are the fixed rows (trimmed, normalized, empty fields removed).
A row is valid when it has no validation errors and (if confidence scoring is on) confidence ≥ 50.

Quarantine dataset

One item per invalid or failed row.
Each item includes:
- row — the row (fixed if autoFix was applied).
- reason — short reason (e.g. "Validation failed", "Low confidence score", "Processing error").
- details — array of messages (validation errors, drift, duplicate, anomalies).
- score — confidence score (if confidence scoring is enabled).

Key-value store

Key	Description
QUALITY_REPORT	Summary of the run (counts, average confidence, quarantine reasons, drift summary).
OUTPUT	Object with `qualityReport`, `cleanDatasetId`, and `quarantineDatasetId`.
SCHEMA_BASELINE	Baseline schema used for drift detection; can be reused in later runs.

Example QUALITY_REPORT:

{
  "totalRows": 1000,
  "validRows": 950,
  "invalidRows": 50,
  "duplicates": 5,
  "anomalies": 12,
  "schemaChanges": 3,
  "averageConfidence": 87.5,
  "quarantineReasonCounts": { "Validation failed": 30, "Low confidence score": 20 },
  "driftFieldsSummary": ["new:extraField", "missing:oldField"]
}

Use Cases

Post-scrape quality control — Run after a scraper to validate and clean its dataset before downstream use.
Pipeline gates — Use the quality report (e.g. invalidRows, averageConfidence) to fail or alert when quality drops below a threshold.
Monitoring — Schedule the Actor on a dataset to track schema drift and anomalies over time.
Data cleaning — Use auto-fix and the clean dataset as the single source of validated records.

Integrations

n8n

Run the Actor from n8n using the Apify community node (@apify/n8n-nodes-apify).

Prerequisites: Install the Apify node and create Apify API credentials (your API token).
Basic flow: Run Actor (with “Wait for finish”) → Get Record (key OUTPUT from run’s default key-value store) → Get Items (using cleanDatasetId from OUTPUT).
Branching: Use qualityReport.invalidRows, qualityReport.averageConfidence, or qualityReport.schemaChanges in an IF or Switch node to alert or branch.

An importable workflow is provided: integrations/n8n-workflow-example.json. See integrations/README.md for more detail.

OpenClaw

OpenClaw can receive run summaries and trigger the Actor via skills or the Exec tool.

Webhook: Set OpenClaw webhook URL and OpenClaw webhook token in the Actor input. When the run completes, the Actor POSTs a short summary to that URL.
Triggering: Use the Exec tool with the Apify API, or install the integrations/openclaw/skill and set APIFY_TOKEN (and optionally Actor ID) in your OpenClaw config.

Full details: integrations/openclaw/README.md.

General integration tips

Chain after another Actor — Use a scraper’s or API Actor’s output dataset as datasetId; consume the clean dataset in your pipeline.
Branch on quality — Read QUALITY_REPORT from the run’s key-value store and branch or alert when invalidRows or averageConfidence crosses a threshold.
Inspect quarantine — Use the quarantine dataset to review bad rows and tune requiredfields, numericRules, or duplicate key fields.
Debug — Set "debug": true to log validation reasoning and per-row details.

Development

Tech stack

Node.js 18+
Apify SDK — dataset streaming via Dataset.forEach; no browser or crawler.

Local testing

npm run test:seed    # Seed local storage with sample input and dataset
npm run test         # Run the Actor locally (uses ./storage)

From project root, ensure CRAWLEE_STORAGE_DIR points to your storage directory (default: ./storage).

Memory considerations

Duplicate detection keeps an in-memory set of hashes for the configured key fields. For very large datasets with many unique keys, memory use can grow. If you hit limits, disable duplicate detection or reduce the number of key fields.

License

Licensed under the LICENSE.

Dataset Quality Scorer

fiery_dream/dataset-quality-scorer

Score ML datasets for quality (completeness, consistency, duplicates, balance). Detect data drift, outliers, and recommend improvements.

Cody Churchwell

JSON Content Checker & Validator - API Testing Tool

scrappy_garden/json-content-checker

Validate JSON content, check API responses, monitor data quality, and detect schema changes. Perfect for API testing, data validation, quality assurance, and monitoring JSON endpoints. Supports JSONPath, schema validation, and custom rules.

Bikram Adhikari

Validate Json Schema — Data, Details & Metadata

tropical_quince/json-schema-validator

Validate json schema data at scale with this powerful Apify actor. Extracts data, details & metadata with automatic pagination and proxy rotation. Perfect for market research, competitive intelligence, and data-driven decision making.

Donny Nguyen

Schema.org Structured Data Bulk Validator

taroyamada/structured-data-validator

Crawl and validate JSON-LD/Microdata structured data across multiple pages. Detect missing or malformed Schema.org markup at scale.

太郎山田

Apify Smart Dataset Comparator

agenscrape/apify-smart-dataset-comparator

Compare 2-10 Apify datasets to detect changes, new/removed records, and duplicates. Features field-level diffs, smart merging, schema validation, data cleaning, and anomaly detection. Perfect for price monitoring, lead deduplication, and data quality tracking.

Agenscrape

Guardian Article Scraper

urban_quidnunc/guardian-article-scraper

Guardian Article Scraper - Extracts and processes data efficiently on the Apify platform.

Donny

Schema.org Markup Validator

scrappy_garden/schema-org-markup-validator

Validate Schema.org structured data for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema types, and reports common issues like invalid JSON-LD, missing @type, non-schema.org @context, and missing key properties for popular schema types.

Bikram Adhikari

Schema Drift Detector

quantifiable_bouquet/schema-drift-detector

Detect website structure changes before your scrapers break. Monitors pages with a headless browser, compares DOM fingerprints or watched selectors across runs, and alerts on drift via dataset or webhook. Built for reliable scraping pipelines.

Hayder Al-Khalissi

Water Quality Data Scraper

consummate_mandala/water-quality-data-scraper

Water Quality Data Scraper scrapes data and outputs structured results. Runs on Apify platform with scheduling, API access, and proxy support.

Donny Nguyen

Validate Dataset(s) with JSON Schema

jaroslavhejlek/validate-dataset-with-json-schema

This Actor validates items in one or more datasets against a provided JSON Schema. Use it if you planning to add a dataset validation schema to your actor and you want test it.

Jaroslav Hejlek

Ai Data Quality Guardian

Table of Contents

Overview

Features

Quick Start

Prerequisites

Running the Actor

Input

Minimal

Full configuration

Output

Clean dataset

Quarantine dataset

Key-value store

Use Cases

Integrations

n8n

OpenClaw

General integration tips

Development

Tech stack

Local testing

Memory considerations

License

Dataset Quality Scorer

JSON Content Checker & Validator - API Testing Tool

Validate Json Schema — Data, Details & Metadata

Schema.org Structured Data Bulk Validator

Apify Smart Dataset Comparator

Guardian Article Scraper

Schema.org Markup Validator

Schema Drift Detector

Water Quality Data Scraper

Validate Dataset(s) with JSON Schema

Related articles

Ai Data Quality Guardian

Table of Contents

Overview

Features

Quick Start

Prerequisites

Running the Actor

Input

Minimal

Full configuration

Output

Clean dataset

Quarantine dataset

Key-value store

Use Cases

Integrations

n8n

OpenClaw

General integration tips

Development

Tech stack

Local testing

Memory considerations

License

You might also like

Dataset Quality Scorer

JSON Content Checker & Validator - API Testing Tool

Validate Json Schema — Data, Details & Metadata

Schema.org Structured Data Bulk Validator

Apify Smart Dataset Comparator

Guardian Article Scraper

Schema.org Markup Validator

Schema Drift Detector

Water Quality Data Scraper

Validate Dataset(s) with JSON Schema

Related articles