Dataset Deduplicator
Pricing
Pay per event
Dataset Deduplicator
Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Merge and deduplicate items across one or more Apify datasets. Specify any combination of fields as the dedup key, choose whether to keep the first or last occurrence, and export clean, unique data to a new dataset — all through a simple no-code interface or the Apify API.
What does Dataset Deduplicator do?
Dataset Deduplicator is an Apify utility actor that removes duplicate items from your datasets. It reads items from one or more source datasets, compares them by the fields you specify, and outputs only unique records.
📦 Merge multiple datasets — Combine results from several scraper runs into one clean dataset 🔍 Flexible dedup keys — Use any field or combination of fields (URL, email, name + city, etc.) 🔄 First or last wins — Choose whether to keep the earliest or most recent occurrence of each duplicate 📊 Stats tracking — Get a summary of total items loaded, unique items kept, and duplicates removed ⚡ Fast batch processing — Processes items in batches of 1,000 for efficient memory usage 🎯 Case-insensitive matching — Optionally ignore case differences when comparing field values
Who is Dataset Deduplicator for?
🧑💻 Data engineers — Clean up datasets before loading into databases or data warehouses 🕷️ Web scraping professionals — Merge outputs from multiple scraper runs without duplicates 🤖 Automation builders — Add deduplication as a post-processing step in your workflows 📈 Analysts — Ensure data quality before running reports or feeding data into dashboards 🏢 Teams on Apify — Maintain a single source of truth across scheduled scraping jobs
Why use Dataset Deduplicator?
Running scrapers over time leads to duplicate data. Multiple runs collect overlapping results, scheduled jobs re-scrape the same pages, and merging datasets from different sources introduces repeated records. Duplicates waste storage, slow down analysis, and cause errors in downstream systems.
🛠️ No code required — Configure everything through Apify Console, no scripting needed
🔗 Works with any dataset — Compatible with output from any Apify actor or API-uploaded data
📐 Composite keys — Deduplicate on multiple fields simultaneously (e.g., name + city + email)
✂️ Whitespace handling — Automatically trims leading/trailing whitespace before comparison
💰 Pay per use — Only pay for the items you process, no monthly fees
🗂️ Custom output — Write results to the default run dataset or a specific named dataset
How much does it cost to deduplicate datasets?
Dataset Deduplicator uses Apify's pay-per-event pricing. You only pay for what you process.
| Event | Price | Description |
|---|---|---|
| Run started | $0.005 | One-time fee per run |
| Items processed | $0.002 per 1,000 items | Charged per batch of 1,000 items loaded from source datasets |
Pricing examples:
| Items processed | Cost |
|---|---|
| 1,000 | $0.007 |
| 10,000 | $0.025 |
| 100,000 | $0.205 |
| 1,000,000 | $2.005 |
The actor runs on minimal memory (256 MB default) and completes quickly, so platform compute costs are negligible.
How to deduplicate datasets step by step
1️⃣ Get your dataset IDs — Find them in Apify Console under each actor run's "Dataset" tab, or from the API response after a run completes.
2️⃣ Choose your dedup fields — Decide which fields uniquely identify a record. For URLs, use url. For contacts, try email or name + company.
3️⃣ Go to Dataset Deduplicator on Apify Store and click "Try for free".
4️⃣ Enter your configuration:
- Paste one or more dataset IDs into the "Dataset IDs" field
- Add the field names to use for dedup comparison
- Choose whether to keep the first or last occurrence
- Enable case-insensitive mode if needed
5️⃣ Click Start and wait for the run to complete.
6️⃣ Download results — Go to the "Dataset" tab in your run to preview, export as JSON/CSV/Excel, or connect via API.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
datasetIds | array of strings | Yes | — | One or more Apify dataset IDs to load and deduplicate |
fields | array of strings | Yes | — | Field names used as the dedup key. Items matching on all fields are duplicates |
keepOrder | string | No | first | Keep the first or last occurrence of each duplicate |
caseInsensitive | boolean | No | false | Ignore case when comparing field values |
trimWhitespace | boolean | No | true | Trim leading/trailing whitespace before comparison |
outputDatasetId | string | No | — | Write results to a specific named dataset instead of the run's default dataset |
maxItems | integer | No | 1000000 | Maximum number of items to process across all source datasets |
Input example
{"datasetIds": ["abc123", "def456", "ghi789"],"fields": ["url"],"keepOrder": "first","caseInsensitive": false,"trimWhitespace": true,"maxItems": 50000}
Output data
The actor outputs deduplicated items to the run's default dataset (or a named dataset if outputDatasetId is set). The output items have the exact same structure as the input — no fields are added or removed.
Additionally, the actor stores a stats summary in the key-value store under the key stats:
| Field | Type | Description |
|---|---|---|
totalLoaded | integer | Total number of items loaded from all source datasets |
uniqueKept | integer | Number of unique items written to the output dataset |
duplicatesRemoved | integer | Number of duplicate items that were discarded |
datasetsProcessed | integer | Number of source datasets that were read |
Output example
Deduplicated dataset items (same structure as input):
[{"url": "https://example.com/product/1","title": "Widget A","price": 29.99},{"url": "https://example.com/product/2","title": "Widget B","price": 49.99}]
Stats (key-value store → stats):
{"totalLoaded": 15000,"uniqueKept": 12350,"duplicatesRemoved": 2650,"datasetsProcessed": 3}
Tips and best practices
🔑 Use composite keys for accuracy — Deduplicating on a single field like name can be too aggressive. Combine fields like name + city or email + company for precise matching.
🔤 Enable case-insensitive mode for text fields — URLs and emails often differ only in case (HTTP://Example.com vs http://example.com). Turn on caseInsensitive to catch these.
📋 Merge datasets from scheduled runs — If you run a scraper daily, pass all recent dataset IDs to combine and deduplicate them into one clean dataset.
⏱️ Use keepOrder: last for freshness — When merging datasets over time, keep the latest occurrence to ensure you have the most up-to-date data for each item.
🧪 Test with a small maxItems first — When working with large datasets, start with maxItems: 1000 to verify your field selection before processing millions of items.
📁 Use outputDatasetId for persistent storage — Write to a named dataset to build up a clean, deduplicated master dataset across multiple runs.
Common use cases
🛒 E-commerce price monitoring — Merge daily product scrapes and keep only the latest price per product URL.
📧 Lead generation — Combine contact lists from multiple sources and remove duplicate emails.
🏠 Real estate listings — Deduplicate property listings collected from different pages or search queries.
📰 News aggregation — Merge articles from multiple scraper runs and remove duplicates by headline or URL.
🔍 SEO auditing — Combine crawl results and deduplicate by page URL to get a clean site map.
Integrations
Dataset Deduplicator fits naturally into Apify workflows as a post-processing step.
🔗 Chained actors — Use Apify's actor-to-actor calling to trigger deduplication after a scraper finishes.
📅 Scheduled cleanup — Set up a schedule to periodically deduplicate accumulated data.
⚡ Webhooks — Trigger deduplication via webhooks when a scraper run succeeds.
🔌 Zapier / Make — Connect with Zapier or Make to automate dedup in multi-step workflows.
📊 Data pipelines — Use as a cleaning step before pushing data to Google Sheets, Airtable, databases, or BI tools.
How to use Dataset Deduplicator via API
You can run Dataset Deduplicator programmatically using the Apify API. Here are examples in multiple languages.
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/dataset-dedup').call({datasetIds: ['dataset-id-1', 'dataset-id-2'],fields: ['url'],keepOrder: 'first',caseInsensitive: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Got ${items.length} unique items`);
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("automation-lab/dataset-dedup").call(run_input={"datasetIds": ["dataset-id-1", "dataset-id-2"],"fields": ["url"],"keepOrder": "first","caseInsensitive": True,})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"Got {len(items)} unique items")
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~dataset-dedup/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"datasetIds": ["dataset-id-1", "dataset-id-2"],"fields": ["url"],"keepOrder": "first","caseInsensitive": true}'
Use with Claude AI (MCP)
This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Example prompts
- "I ran a scraper three times and got datasets abc123, def456, and ghi789 — can you merge and deduplicate them by URL?"
- "Deduplicate this dataset by email address and keep the most recent record for each duplicate"
- "I have two lead lists from different sources — combine them into one clean list with no duplicate emails"
Learn more in the Apify MCP documentation.
Is it legal to deduplicate datasets?
✅ Yes. Dataset Deduplicator operates exclusively on your own Apify datasets. It does not scrape websites, access third-party systems, or collect any external data. It is a data cleaning utility that processes data you already own and have stored on the Apify platform.
There are no legal concerns with deduplicating your own data.
Frequently asked questions
Can I deduplicate a single dataset?
Yes. Pass a single dataset ID in the datasetIds array. The actor will remove duplicates within that one dataset.
What happens if a field is missing from some items?
Missing fields are treated as empty strings. Two items that both lack a field will match on that field. Combine multiple fields to avoid false matches.
Can I deduplicate on nested fields?
Currently, only top-level fields are supported. If you need to deduplicate on nested values, consider flattening your data first.
Does the actor modify the original datasets?
No. Source datasets are read-only. The actor writes deduplicated items to a new output dataset and never modifies the originals.
How many items can it handle?
The default limit is 1,000,000 items. You can adjust this with the maxItems parameter. The actor processes items in memory, so very large datasets may require higher memory allocation.
Can I use it with datasets from other actors?
Yes. Any valid Apify dataset ID works, regardless of which actor created it.
What format is the output?
The output dataset contains the exact same item structure as the input. No fields are added, removed, or transformed.
How do I find my dataset IDs?
In Apify Console, go to Storage > Datasets to see all your datasets with their IDs. You can also find the dataset ID in the "Dataset" tab of any actor run.
Legality
Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.
Related tools
🔗 Dataset Deduplicator — This actor
Browse more tools from automation-lab on Apify Store.