Apify Smart Dataset Comparator avatar
Apify Smart Dataset Comparator

Pricing

Pay per event

Go to Apify Store
Apify Smart Dataset Comparator

Apify Smart Dataset Comparator

Compare 2-10 Apify datasets to detect changes, new/removed records, and duplicates. Features field-level diffs, smart merging, schema validation, data cleaning, and anomaly detection. Perfect for price monitoring, lead deduplication, and data quality tracking.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Agenscrape

Agenscrape

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Smart Dataset Comparator & Change Detector

Compare 2-10 Apify datasets to detect changes, new/removed records, duplicates, and merge data with custom rules. Perfect for price monitoring, lead deduplication, SEO tracking, and data quality validation.

Facing an issue, unexpected error, edge case, or have a feature suggestion? Post it here and we'll address it within 24 hours.

Quick Start

{
"datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],
"primaryKey": "url"
}

That's it! Get instant comparison results showing what changed, what's new, and what was removed.

What You Get

OutputDescription
ChangesRecords that changed with field-level before/after diffs
New RecordsRecords only in newer datasets
Removed RecordsRecords only in older datasets
Merged DataAll unique records combined using your merge strategy
DuplicatesExact and fuzzy duplicate detection
Schema AnalysisField types, conflicts, and consistency checks
AnomaliesLarge price changes (>50%), stock depletions

Features

Change Detection

Compares records by primary key and shows exactly what changed:

{
"key": "product-123",
"changes": {
"price": { "old": 99.99, "new": 79.99, "type": "decreased" },
"stock": { "old": 100, "new": 0, "type": "decreased" }
}
}

Smart Presets

Pre-configured settings for common use cases:

PresetBest ForWhat It Does
price_monitoringE-commerce, competitors1% price tolerance, ignores timestamps
lead_listsCRM, marketingNormalizes emails/phones, fuzzy dedup
seoContent monitoringStrict comparison, URL normalization
real_estateProperty listings0.5% price tolerance, phone normalization

Data Cleaning

Normalize data before comparison:

  • Emails: Test+Spam@Gmail.comtest@gmail.com
  • Phones: (555) 123-45675551234567
  • URLs: Remove tracking params (utm_*, fbclid)
  • Currency: $1,234.561234.56

Merge Strategies

When same record exists in multiple datasets:

StrategyDescription
left_priorityFirst dataset wins (default)
right_priorityLast/newest dataset wins
most_recentRecord with newest timestamp wins
most_completeRecord with most filled fields wins
combine_arraysMerge array fields from all records
average_numbersAverage numeric fields

Duplicate Detection

Find duplicates within each dataset:

  • Exact: Same primary key
  • Fuzzy: Similar records using Levenshtein distance (configurable threshold)

Schema Validation

Detect inconsistencies across datasets:

  • Missing fields
  • Type conflicts (string vs number)
  • New fields added

Input Parameters

Required

ParameterTypeDescription
datasetIdsarray2-10 Apify dataset IDs to compare
primaryKeystringField to uniquely identify records (supports product.id dot notation)

Optional

ParameterTypeDefaultDescription
presetstring-price_monitoring, lead_lists, seo, real_estate
ignoreFieldsarray[]Fields to skip during comparison
sensitivitystringstrictstrict, medium, relaxed
numericTolerancenumber0Ignore changes below this %
detectDuplicatesbooleanfalseFind duplicates within datasets
fuzzyMatchingbooleanfalseEnable fuzzy duplicate detection
fuzzyThresholdnumber0.85Similarity threshold (0-1)
validateSchemabooleanfalseCompare schemas across datasets
mergeStrategystringleft_priorityHow to merge conflicting records
webhookUrlstring-URL for completion notification

Cleaning Rules

{
"cleaningRules": {
"trimStrings": true,
"normalizeEmails": true,
"normalizePhones": true,
"normalizeUrls": true,
"normalizeCurrency": true,
"removeEmojis": true
}
}

Full Example

{
"datasetIds": ["abc123", "def456"],
"primaryKey": "url",
"preset": "price_monitoring",
"ignoreFields": ["lastChecked", "scraperVersion"],
"detectDuplicates": true,
"fuzzyMatching": true,
"fuzzyThreshold": 0.85,
"validateSchema": true,
"mergeStrategy": "most_recent",
"cleaningRules": {
"trimStrings": true,
"normalizeCurrency": true
},
"webhookUrl": "https://your-webhook.com/notify"
}

Output

Results are saved to multiple datasets (with run ID suffix for isolation):

  • default - Summary + all records with _type marker for filtering
  • changes-{runId} - Changed records with diffs
  • new-records-{runId} - New records
  • removed-records-{runId} - Removed records
  • merged-final-{runId} - All unique records merged
  • duplicates-{runId} - Detected duplicates
  • schema-{runId} - Schema analysis
  • stats-{runId} - Full statistics

Output Tabs

View results in organized tabs in Apify Console:

  • Summary - Stats overview
  • Changes - Modified records with diffs
  • New Records - Added records
  • Removed Records - Deleted records
  • Duplicates - Found duplicates
  • Merged Records - Final merged data

Use Cases

Price Monitoring

Track competitor prices and stock levels:

{
"datasetIds": ["yesterday_scrape", "today_scrape"],
"primaryKey": "productUrl",
"preset": "price_monitoring"
}

Lead Deduplication

Clean contact lists and find new leads:

{
"datasetIds": ["crm_export", "new_leads"],
"primaryKey": "email",
"preset": "lead_lists"
}

SEO Monitoring

Track page changes:

{
"datasetIds": ["last_week_crawl", "this_week_crawl"],
"primaryKey": "url",
"preset": "seo"
}

Database Sync

Identify records to INSERT, UPDATE, DELETE:

{
"datasetIds": ["database_export", "fresh_scrape"],
"primaryKey": "id",
"mergeStrategy": "right_priority"
}

Pricing

Pay-per-event pricing - you only pay for value delivered:

EventPriceDescription
Dataset Loaded$0.01Per dataset loaded
Records Compared$0.005Per 1,000 records
Change Detected$0.002Per change/new/removed
Duplicate Found$0.005Per duplicate
Records Merged$0.002Per 1,000 records
Records Cleaned$0.002Per 1,000 records
Schema Validation$0.02Once per run
Anomaly Detected$0.01Per anomaly
Webhook Sent$0.005Per notification
Preset Used$0.01Once per run
Fuzzy Matching$0.02Once per run

Example Costs

ScenarioRecordsChangesCost
Small comparison1,00050~$0.13
Medium comparison10,000500~$1.17
Large comparison100,0005,000~$10.60

Webhook Payload

{
"status": "completed",
"summary": {
"datasetsCompared": 2,
"stats": {
"changedCount": 150,
"newCount": 25,
"removedCount": 10
},
"anomaliesCount": 5,
"duplicatesCount": 12,
"mergedCount": 1000
},
"actorRunId": "abc123..."
}

Tips

  1. Use presets - They're optimized for common use cases
  2. Set ignoreFields - Skip timestamps and scraper metadata
  3. Enable fuzzyMatching - Catch near-duplicates in lead lists
  4. Use webhooks - Get notified when comparison completes
  5. Check anomalies - Large price swings might indicate data issues

Support

Questions or issues? Open an issue on the actor's GitHub repository.