Dataset Quality Scorer avatar
Dataset Quality Scorer

Pricing

Pay per event

Go to Apify Store
Dataset Quality Scorer

Dataset Quality Scorer

Score ML datasets for quality (completeness, consistency, duplicates, balance). Detect data drift, outliers, and recommend improvements.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Cody Churchwell

Cody Churchwell

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

Dataset Quality Scorer for ML

Score ML datasets for quality (completeness, consistency, duplicates, balance). Detect data drift, outliers, and recommend improvements.

Features

  • Quality Scoring: Comprehensive scoring across multiple dimensions
  • Data Drift Detection: Compare datasets to detect distribution changes
  • Outlier Detection: Multiple methods (Z-score, IQR, Isolation Forest)
  • Schema Validation: Validate data against expected schema
  • Full Reports: Generate comprehensive quality reports

Operations

1. Score Dataset Quality

Analyze dataset quality across:

  • Completeness: Missing values detection
  • Consistency: Type consistency across columns
  • Duplicates: Duplicate row detection
  • Balance: Class distribution for ML tasks

2. Detect Data Drift

Compare current dataset against baseline to detect:

  • Distribution changes in numerical features
  • Category changes in categorical features
  • Drift severity assessment

3. Find Outliers

Detect anomalous values using:

  • Z-score method
  • Interquartile Range (IQR)
  • Isolation Forest

4. Validate Schema

Validate dataset structure:

  • Column presence/absence
  • Data type conformity
  • Required field checks

5. Generate Full Report

Comprehensive analysis combining all checks

Input Parameters

  • operation: Operation to perform (required)
  • datasetUrl: URL to CSV/JSON dataset or Apify dataset ID
  • datasetData: Inline dataset as JSON array
  • schemaDefinition: Expected schema for validation
  • baselineDataset: Baseline for drift detection
  • outlierMethod: Method for outlier detection (zscore, iqr, isolation_forest)
  • outlierThreshold: Threshold value (default: 3)
  • checkDuplicates: Enable duplicate checking (default: true)
  • checkBalance: Enable class balance checking (default: true)
  • targetColumn: Target column for balance analysis

Use Cases

  1. Data Quality Assurance: Ensure data meets quality standards before training
  2. Data Drift Monitoring: Monitor production data for distribution shifts
  3. Outlier Detection: Identify anomalies that could harm model performance
  4. Schema Validation: Verify data structure before processing
  5. ML Pipeline Integration: Automated quality gates in ML workflows

Example

{
"operation": "generateReport",
"datasetUrl": "https://example.com/data.csv",
"schemaDefinition": {
"columns": {
"id": "number",
"feature1": "number",
"label": "string"
},
"required": ["id", "label"]
},
"checkDuplicates": true,
"checkBalance": true,
"targetColumn": "label",
"outlierMethod": "zscore",
"outlierThreshold": 3
}

Output

Returns quality scores, detected issues, and actionable recommendations for improving dataset quality.

Market Gap

First free comprehensive dataset quality tool. Alternatives like Great Expectations require complex setup. This actor provides instant quality scoring with zero configuration.

Target MAU: 800