Dataset Deduplicator avatar

Dataset Deduplicator

Pricing

Pay per event

Go to Apify Store
Dataset Deduplicator

Dataset Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Merge and deduplicate items across one or more Apify datasets. Specify any combination of fields as the dedup key, choose whether to keep the first or last occurrence, and export clean, unique data to a new dataset — all through a simple no-code interface or the Apify API.

What does Dataset Deduplicator do?

Dataset Deduplicator is an Apify utility actor that removes duplicate items from your datasets. It reads items from one or more source datasets, compares them by the fields you specify, and outputs only unique records.

📦 Merge multiple datasets — Combine results from several scraper runs into one clean dataset 🔍 Flexible dedup keys — Use any field or combination of fields (URL, email, name + city, etc.) 🔄 First or last wins — Choose whether to keep the earliest or most recent occurrence of each duplicate 📊 Stats tracking — Get a summary of total items loaded, unique items kept, and duplicates removed ⚡ Fast batch processing — Processes items in batches of 1,000 for efficient memory usage 🎯 Case-insensitive matching — Optionally ignore case differences when comparing field values

Who is Dataset Deduplicator for?

🧑‍💻 Data engineers — Clean up datasets before loading into databases or data warehouses 🕷️ Web scraping professionals — Merge outputs from multiple scraper runs without duplicates 🤖 Automation builders — Add deduplication as a post-processing step in your workflows 📈 Analysts — Ensure data quality before running reports or feeding data into dashboards 🏢 Teams on Apify — Maintain a single source of truth across scheduled scraping jobs

Why use Dataset Deduplicator?

Running scrapers over time leads to duplicate data. Multiple runs collect overlapping results, scheduled jobs re-scrape the same pages, and merging datasets from different sources introduces repeated records. Duplicates waste storage, slow down analysis, and cause errors in downstream systems.

🛠️ No code required — Configure everything through Apify Console, no scripting needed 🔗 Works with any dataset — Compatible with output from any Apify actor or API-uploaded data 📐 Composite keys — Deduplicate on multiple fields simultaneously (e.g., name + city + email) ✂️ Whitespace handling — Automatically trims leading/trailing whitespace before comparison 💰 Pay per use — Only pay for the items you process, no monthly fees 🗂️ Custom output — Write results to the default run dataset or a specific named dataset

How much does it cost to deduplicate datasets?

Dataset Deduplicator uses Apify's pay-per-event pricing. You only pay for what you process.

EventPriceDescription
Run started$0.005One-time fee per run
Items processed$0.002 per 1,000 itemsCharged per batch of 1,000 items loaded from source datasets

Pricing examples:

Items processedCost
1,000$0.007
10,000$0.025
100,000$0.205
1,000,000$2.005

The actor runs on minimal memory (256 MB default) and completes quickly, so platform compute costs are negligible.

How to deduplicate datasets step by step

1️⃣ Get your dataset IDs — Find them in Apify Console under each actor run's "Dataset" tab, or from the API response after a run completes.

2️⃣ Choose your dedup fields — Decide which fields uniquely identify a record. For URLs, use url. For contacts, try email or name + company.

3️⃣ Go to Dataset Deduplicator on Apify Store and click "Try for free".

4️⃣ Enter your configuration:

  • Paste one or more dataset IDs into the "Dataset IDs" field
  • Add the field names to use for dedup comparison
  • Choose whether to keep the first or last occurrence
  • Enable case-insensitive mode if needed

5️⃣ Click Start and wait for the run to complete.

6️⃣ Download results — Go to the "Dataset" tab in your run to preview, export as JSON/CSV/Excel, or connect via API.

Input parameters

ParameterTypeRequiredDefaultDescription
datasetIdsarray of stringsYesOne or more Apify dataset IDs to load and deduplicate
fieldsarray of stringsYesField names used as the dedup key. Items matching on all fields are duplicates
keepOrderstringNofirstKeep the first or last occurrence of each duplicate
caseInsensitivebooleanNofalseIgnore case when comparing field values
trimWhitespacebooleanNotrueTrim leading/trailing whitespace before comparison
outputDatasetIdstringNoWrite results to a specific named dataset instead of the run's default dataset
maxItemsintegerNo1000000Maximum number of items to process across all source datasets

Input example

{
"datasetIds": ["abc123", "def456", "ghi789"],
"fields": ["url"],
"keepOrder": "first",
"caseInsensitive": false,
"trimWhitespace": true,
"maxItems": 50000
}

Output data

The actor outputs deduplicated items to the run's default dataset (or a named dataset if outputDatasetId is set). The output items have the exact same structure as the input — no fields are added or removed.

Additionally, the actor stores a stats summary in the key-value store under the key stats:

FieldTypeDescription
totalLoadedintegerTotal number of items loaded from all source datasets
uniqueKeptintegerNumber of unique items written to the output dataset
duplicatesRemovedintegerNumber of duplicate items that were discarded
datasetsProcessedintegerNumber of source datasets that were read

Output example

Deduplicated dataset items (same structure as input):

[
{
"url": "https://example.com/product/1",
"title": "Widget A",
"price": 29.99
},
{
"url": "https://example.com/product/2",
"title": "Widget B",
"price": 49.99
}
]

Stats (key-value store → stats):

{
"totalLoaded": 15000,
"uniqueKept": 12350,
"duplicatesRemoved": 2650,
"datasetsProcessed": 3
}

Tips and best practices

🔑 Use composite keys for accuracy — Deduplicating on a single field like name can be too aggressive. Combine fields like name + city or email + company for precise matching.

🔤 Enable case-insensitive mode for text fields — URLs and emails often differ only in case (HTTP://Example.com vs http://example.com). Turn on caseInsensitive to catch these.

📋 Merge datasets from scheduled runs — If you run a scraper daily, pass all recent dataset IDs to combine and deduplicate them into one clean dataset.

⏱️ Use keepOrder: last for freshness — When merging datasets over time, keep the latest occurrence to ensure you have the most up-to-date data for each item.

🧪 Test with a small maxItems first — When working with large datasets, start with maxItems: 1000 to verify your field selection before processing millions of items.

📁 Use outputDatasetId for persistent storage — Write to a named dataset to build up a clean, deduplicated master dataset across multiple runs.

Common use cases

🛒 E-commerce price monitoring — Merge daily product scrapes and keep only the latest price per product URL.

📧 Lead generation — Combine contact lists from multiple sources and remove duplicate emails.

🏠 Real estate listings — Deduplicate property listings collected from different pages or search queries.

📰 News aggregation — Merge articles from multiple scraper runs and remove duplicates by headline or URL.

🔍 SEO auditing — Combine crawl results and deduplicate by page URL to get a clean site map.

Integrations

Dataset Deduplicator fits naturally into Apify workflows as a post-processing step.

🔗 Chained actors — Use Apify's actor-to-actor calling to trigger deduplication after a scraper finishes.

📅 Scheduled cleanup — Set up a schedule to periodically deduplicate accumulated data.

Webhooks — Trigger deduplication via webhooks when a scraper run succeeds.

🔌 Zapier / Make — Connect with Zapier or Make to automate dedup in multi-step workflows.

📊 Data pipelines — Use as a cleaning step before pushing data to Google Sheets, Airtable, databases, or BI tools.

How to use Dataset Deduplicator via API

You can run Dataset Deduplicator programmatically using the Apify API. Here are examples in multiple languages.

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('automation-lab/dataset-dedup').call({
datasetIds: ['dataset-id-1', 'dataset-id-2'],
fields: ['url'],
keepOrder: 'first',
caseInsensitive: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Got ${items.length} unique items`);

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("automation-lab/dataset-dedup").call(run_input={
"datasetIds": ["dataset-id-1", "dataset-id-2"],
"fields": ["url"],
"keepOrder": "first",
"caseInsensitive": True,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Got {len(items)} unique items")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~dataset-dedup/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetIds": ["dataset-id-1", "dataset-id-2"],
"fields": ["url"],
"keepOrder": "first",
"caseInsensitive": true
}'

Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com"
}
}
}

Example prompts

  • "I ran a scraper three times and got datasets abc123, def456, and ghi789 — can you merge and deduplicate them by URL?"
  • "Deduplicate this dataset by email address and keep the most recent record for each duplicate"
  • "I have two lead lists from different sources — combine them into one clean list with no duplicate emails"

Learn more in the Apify MCP documentation.

Yes. Dataset Deduplicator operates exclusively on your own Apify datasets. It does not scrape websites, access third-party systems, or collect any external data. It is a data cleaning utility that processes data you already own and have stored on the Apify platform.

There are no legal concerns with deduplicating your own data.

Frequently asked questions

Can I deduplicate a single dataset?

Yes. Pass a single dataset ID in the datasetIds array. The actor will remove duplicates within that one dataset.

What happens if a field is missing from some items?

Missing fields are treated as empty strings. Two items that both lack a field will match on that field. Combine multiple fields to avoid false matches.

Can I deduplicate on nested fields?

Currently, only top-level fields are supported. If you need to deduplicate on nested values, consider flattening your data first.

Does the actor modify the original datasets?

No. Source datasets are read-only. The actor writes deduplicated items to a new output dataset and never modifies the originals.

How many items can it handle?

The default limit is 1,000,000 items. You can adjust this with the maxItems parameter. The actor processes items in memory, so very large datasets may require higher memory allocation.

Can I use it with datasets from other actors?

Yes. Any valid Apify dataset ID works, regardless of which actor created it.

What format is the output?

The output dataset contains the exact same item structure as the input. No fields are added, removed, or transformed.

How do I find my dataset IDs?

In Apify Console, go to Storage > Datasets to see all your datasets with their IDs. You can also find the dataset ID in the "Dataset" tab of any actor run.

Legality

Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.

🔗 Dataset Deduplicator — This actor

Browse more tools from automation-lab on Apify Store.