Dataset Quality Scorer
Pricing
Pay per event
Dataset Quality Scorer
Score ML datasets for quality (completeness, consistency, duplicates, balance). Detect data drift, outliers, and recommend improvements.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Cody Churchwell
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Dataset Quality Scorer for ML
Score ML datasets for quality (completeness, consistency, duplicates, balance). Detect data drift, outliers, and recommend improvements.
Features
- Quality Scoring: Comprehensive scoring across multiple dimensions
- Data Drift Detection: Compare datasets to detect distribution changes
- Outlier Detection: Multiple methods (Z-score, IQR, Isolation Forest)
- Schema Validation: Validate data against expected schema
- Full Reports: Generate comprehensive quality reports
Operations
1. Score Dataset Quality
Analyze dataset quality across:
- Completeness: Missing values detection
- Consistency: Type consistency across columns
- Duplicates: Duplicate row detection
- Balance: Class distribution for ML tasks
2. Detect Data Drift
Compare current dataset against baseline to detect:
- Distribution changes in numerical features
- Category changes in categorical features
- Drift severity assessment
3. Find Outliers
Detect anomalous values using:
- Z-score method
- Interquartile Range (IQR)
- Isolation Forest
4. Validate Schema
Validate dataset structure:
- Column presence/absence
- Data type conformity
- Required field checks
5. Generate Full Report
Comprehensive analysis combining all checks
Input Parameters
- operation: Operation to perform (required)
- datasetUrl: URL to CSV/JSON dataset or Apify dataset ID
- datasetData: Inline dataset as JSON array
- schemaDefinition: Expected schema for validation
- baselineDataset: Baseline for drift detection
- outlierMethod: Method for outlier detection (zscore, iqr, isolation_forest)
- outlierThreshold: Threshold value (default: 3)
- checkDuplicates: Enable duplicate checking (default: true)
- checkBalance: Enable class balance checking (default: true)
- targetColumn: Target column for balance analysis
Use Cases
- Data Quality Assurance: Ensure data meets quality standards before training
- Data Drift Monitoring: Monitor production data for distribution shifts
- Outlier Detection: Identify anomalies that could harm model performance
- Schema Validation: Verify data structure before processing
- ML Pipeline Integration: Automated quality gates in ML workflows
Example
{"operation": "generateReport","datasetUrl": "https://example.com/data.csv","schemaDefinition": {"columns": {"id": "number","feature1": "number","label": "string"},"required": ["id", "label"]},"checkDuplicates": true,"checkBalance": true,"targetColumn": "label","outlierMethod": "zscore","outlierThreshold": 3}
Output
Returns quality scores, detected issues, and actionable recommendations for improving dataset quality.
Market Gap
First free comprehensive dataset quality tool. Alternatives like Great Expectations require complex setup. This actor provides instant quality scoring with zero configuration.
Target MAU: 800