Apify Smart Dataset Comparator
Pricing
Pay per event
Apify Smart Dataset Comparator
Compare 2-10 Apify datasets to detect changes, new/removed records, and duplicates. Features field-level diffs, smart merging, schema validation, data cleaning, and anomaly detection. Perfect for price monitoring, lead deduplication, and data quality tracking.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Agenscrape
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Smart Dataset Comparator & Change Detector
Compare 2-10 Apify datasets to detect changes, new/removed records, duplicates, and merge data with custom rules. Perfect for price monitoring, lead deduplication, SEO tracking, and data quality validation.
Facing an issue, unexpected error, edge case, or have a feature suggestion? Post it here and we'll address it within 24 hours.
Quick Start
{"datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],"primaryKey": "url"}
That's it! Get instant comparison results showing what changed, what's new, and what was removed.
What You Get
| Output | Description |
|---|---|
| Changes | Records that changed with field-level before/after diffs |
| New Records | Records only in newer datasets |
| Removed Records | Records only in older datasets |
| Merged Data | All unique records combined using your merge strategy |
| Duplicates | Exact and fuzzy duplicate detection |
| Schema Analysis | Field types, conflicts, and consistency checks |
| Anomalies | Large price changes (>50%), stock depletions |
Features
Change Detection
Compares records by primary key and shows exactly what changed:
{"key": "product-123","changes": {"price": { "old": 99.99, "new": 79.99, "type": "decreased" },"stock": { "old": 100, "new": 0, "type": "decreased" }}}
Smart Presets
Pre-configured settings for common use cases:
| Preset | Best For | What It Does |
|---|---|---|
price_monitoring | E-commerce, competitors | 1% price tolerance, ignores timestamps |
lead_lists | CRM, marketing | Normalizes emails/phones, fuzzy dedup |
seo | Content monitoring | Strict comparison, URL normalization |
real_estate | Property listings | 0.5% price tolerance, phone normalization |
Data Cleaning
Normalize data before comparison:
- Emails:
Test+Spam@Gmail.com→test@gmail.com - Phones:
(555) 123-4567→5551234567 - URLs: Remove tracking params (utm_*, fbclid)
- Currency:
$1,234.56→1234.56
Merge Strategies
When same record exists in multiple datasets:
| Strategy | Description |
|---|---|
left_priority | First dataset wins (default) |
right_priority | Last/newest dataset wins |
most_recent | Record with newest timestamp wins |
most_complete | Record with most filled fields wins |
combine_arrays | Merge array fields from all records |
average_numbers | Average numeric fields |
Duplicate Detection
Find duplicates within each dataset:
- Exact: Same primary key
- Fuzzy: Similar records using Levenshtein distance (configurable threshold)
Schema Validation
Detect inconsistencies across datasets:
- Missing fields
- Type conflicts (string vs number)
- New fields added
Input Parameters
Required
| Parameter | Type | Description |
|---|---|---|
datasetIds | array | 2-10 Apify dataset IDs to compare |
primaryKey | string | Field to uniquely identify records (supports product.id dot notation) |
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
preset | string | - | price_monitoring, lead_lists, seo, real_estate |
ignoreFields | array | [] | Fields to skip during comparison |
sensitivity | string | strict | strict, medium, relaxed |
numericTolerance | number | 0 | Ignore changes below this % |
detectDuplicates | boolean | false | Find duplicates within datasets |
fuzzyMatching | boolean | false | Enable fuzzy duplicate detection |
fuzzyThreshold | number | 0.85 | Similarity threshold (0-1) |
validateSchema | boolean | false | Compare schemas across datasets |
mergeStrategy | string | left_priority | How to merge conflicting records |
webhookUrl | string | - | URL for completion notification |
Cleaning Rules
{"cleaningRules": {"trimStrings": true,"normalizeEmails": true,"normalizePhones": true,"normalizeUrls": true,"normalizeCurrency": true,"removeEmojis": true}}
Full Example
{"datasetIds": ["abc123", "def456"],"primaryKey": "url","preset": "price_monitoring","ignoreFields": ["lastChecked", "scraperVersion"],"detectDuplicates": true,"fuzzyMatching": true,"fuzzyThreshold": 0.85,"validateSchema": true,"mergeStrategy": "most_recent","cleaningRules": {"trimStrings": true,"normalizeCurrency": true},"webhookUrl": "https://your-webhook.com/notify"}
Output
Results are saved to multiple datasets (with run ID suffix for isolation):
- default - Summary + all records with
_typemarker for filtering - changes-{runId} - Changed records with diffs
- new-records-{runId} - New records
- removed-records-{runId} - Removed records
- merged-final-{runId} - All unique records merged
- duplicates-{runId} - Detected duplicates
- schema-{runId} - Schema analysis
- stats-{runId} - Full statistics
Output Tabs
View results in organized tabs in Apify Console:
- Summary - Stats overview
- Changes - Modified records with diffs
- New Records - Added records
- Removed Records - Deleted records
- Duplicates - Found duplicates
- Merged Records - Final merged data
Use Cases
Price Monitoring
Track competitor prices and stock levels:
{"datasetIds": ["yesterday_scrape", "today_scrape"],"primaryKey": "productUrl","preset": "price_monitoring"}
Lead Deduplication
Clean contact lists and find new leads:
{"datasetIds": ["crm_export", "new_leads"],"primaryKey": "email","preset": "lead_lists"}
SEO Monitoring
Track page changes:
{"datasetIds": ["last_week_crawl", "this_week_crawl"],"primaryKey": "url","preset": "seo"}
Database Sync
Identify records to INSERT, UPDATE, DELETE:
{"datasetIds": ["database_export", "fresh_scrape"],"primaryKey": "id","mergeStrategy": "right_priority"}
Pricing
Pay-per-event pricing - you only pay for value delivered:
| Event | Price | Description |
|---|---|---|
| Dataset Loaded | $0.01 | Per dataset loaded |
| Records Compared | $0.005 | Per 1,000 records |
| Change Detected | $0.002 | Per change/new/removed |
| Duplicate Found | $0.005 | Per duplicate |
| Records Merged | $0.002 | Per 1,000 records |
| Records Cleaned | $0.002 | Per 1,000 records |
| Schema Validation | $0.02 | Once per run |
| Anomaly Detected | $0.01 | Per anomaly |
| Webhook Sent | $0.005 | Per notification |
| Preset Used | $0.01 | Once per run |
| Fuzzy Matching | $0.02 | Once per run |
Example Costs
| Scenario | Records | Changes | Cost |
|---|---|---|---|
| Small comparison | 1,000 | 50 | ~$0.13 |
| Medium comparison | 10,000 | 500 | ~$1.17 |
| Large comparison | 100,000 | 5,000 | ~$10.60 |
Webhook Payload
{"status": "completed","summary": {"datasetsCompared": 2,"stats": {"changedCount": 150,"newCount": 25,"removedCount": 10},"anomaliesCount": 5,"duplicatesCount": 12,"mergedCount": 1000},"actorRunId": "abc123..."}
Tips
- Use presets - They're optimized for common use cases
- Set ignoreFields - Skip timestamps and scraper metadata
- Enable fuzzyMatching - Catch near-duplicates in lead lists
- Use webhooks - Get notified when comparison completes
- Check anomalies - Large price swings might indicate data issues
Support
Questions or issues? Open an issue on the actor's GitHub repository.