Pricing

Pay per event

Dataset Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

What does Dataset Deduplicator do?

Dataset Deduplicator is an Apify utility actor that removes duplicate items from your datasets. It reads items from one or more source datasets, compares them by the fields you specify, and outputs only unique records.

📦 Merge multiple datasets — Combine results from several scraper runs into one clean dataset 🔍 Flexible dedup keys — Use any field or combination of fields (URL, email, name + city, etc.) 🔄 First or last wins — Choose whether to keep the earliest or most recent occurrence of each duplicate 📊 Stats tracking — Get a summary of total items loaded, unique items kept, and duplicates removed ⚡ Fast batch processing — Processes items in batches of 1,000 for efficient memory usage 🎯 Case-insensitive matching — Optionally ignore case differences when comparing field values

Who is it for? Dataset Deduplicator users

🧑‍💻 Data engineers — Clean up datasets before loading into databases or data warehouses 🕷️ Web scraping professionals — Merge outputs from multiple scraper runs without duplicates 🤖 Automation builders — Add deduplication as a post-processing step in your workflows 📈 Analysts — Ensure data quality before running reports or feeding data into dashboards 🏢 Teams on Apify — Maintain a single source of truth across scheduled scraping jobs

Why use Dataset Deduplicator?

Running scrapers over time leads to duplicate data. Multiple runs collect overlapping results, scheduled jobs re-scrape the same pages, and merging datasets from different sources introduces repeated records. Duplicates waste storage, slow down analysis, and cause errors in downstream systems.

🛠️ No code required — Configure everything through Apify Console, no scripting needed 🔗 Works with any dataset — Compatible with output from any Apify actor or API-uploaded data 📐 Composite keys — Deduplicate on multiple fields simultaneously (e.g., name + city + email) ✂️ Whitespace handling — Automatically trims leading/trailing whitespace before comparison 💰 Pay per use — Only pay for the items you process, no monthly fees 🗂️ Custom output — Write results to the default run dataset or a specific named dataset

How much does it cost to deduplicate datasets?

Dataset Deduplicator uses Apify's pay-per-event pricing. You only pay for what you process.

Event	Price	Description
Run started	$0.005	One-time fee per run
Items processed	$0.002 per 1,000 items	Charged per batch of 1,000 items loaded from source datasets

Pricing examples:

Items processed	Cost
1,000	$0.007
10,000	$0.025
100,000	$0.205
1,000,000	$2.005

The actor runs on minimal memory (256 MB default) and completes quickly, so platform compute costs are negligible.

How to deduplicate datasets step by step

1️⃣ Get your dataset IDs — Find them in Apify Console under each actor run's "Dataset" tab, or from the API response after a run completes.

2️⃣ Choose your dedup fields — Decide which fields uniquely identify a record. For URLs, use url. For contacts, try email or name + company.

3️⃣ Go to Dataset Deduplicator on Apify Store and click "Try for free".

4️⃣ Enter your configuration:

Paste one or more dataset IDs into the "Dataset IDs" field
Add the field names to use for dedup comparison
Choose whether to keep the first or last occurrence
Enable case-insensitive mode if needed

5️⃣ Click Start and wait for the run to complete.

6️⃣ Download results — Go to the "Dataset" tab in your run to preview, export as JSON/CSV/Excel, or connect via API.

Input parameters

Parameter	Type	Required	Default	Description
`datasetIds`	array of strings	Yes	—	One or more Apify dataset IDs to load and deduplicate
`fields`	array of strings	Yes	—	Field names used as the dedup key. Items matching on all fields are duplicates
`keepOrder`	string	No	`first`	Keep the `first` or `last` occurrence of each duplicate
`caseInsensitive`	boolean	No	`false`	Ignore case when comparing field values
`trimWhitespace`	boolean	No	`true`	Trim leading/trailing whitespace before comparison
`outputDatasetId`	string	No	—	Write results to a specific named dataset instead of the run's default dataset
`maxItems`	integer	No	`1000000`	Maximum number of items to process across all source datasets

Input example

{
    "datasetIds": ["abc123", "def456", "ghi789"],
    "fields": ["url"],
    "keepOrder": "first",
    "caseInsensitive": false,
    "trimWhitespace": true,
    "maxItems": 50000
}

Output data

The actor outputs deduplicated items to the run's default dataset (or a named dataset if outputDatasetId is set). The output items have the exact same structure as the input — no fields are added or removed.

Additionally, the actor stores a stats summary in the key-value store under the key stats:

Field	Type	Description
`totalLoaded`	integer	Total number of items loaded from all source datasets
`uniqueKept`	integer	Number of unique items written to the output dataset
`duplicatesRemoved`	integer	Number of duplicate items that were discarded
`datasetsProcessed`	integer	Number of source datasets that were read

Output example

Deduplicated dataset items (same structure as input):

[
    {
        "url": "https://example.com/product/1",
        "title": "Widget A",
        "price": 29.99
    },
    {
        "url": "https://example.com/product/2",
        "title": "Widget B",
        "price": 49.99
    }
]

Stats (key-value store → stats):

{
    "totalLoaded": 15000,
    "uniqueKept": 12350,
    "duplicatesRemoved": 2650,
    "datasetsProcessed": 3
}

Tips and best practices

🔑 Use composite keys for accuracy — Deduplicating on a single field like name can be too aggressive. Combine fields like name + city or email + company for precise matching.

🔤 Enable case-insensitive mode for text fields — URLs and emails often differ only in case (HTTP://Example.com vs http://example.com). Turn on caseInsensitive to catch these.

📋 Merge datasets from scheduled runs — If you run a scraper daily, pass all recent dataset IDs to combine and deduplicate them into one clean dataset.

⏱️ Use keepOrder: last for freshness — When merging datasets over time, keep the latest occurrence to ensure you have the most up-to-date data for each item.

🧪 Test with a small maxItems first — When working with large datasets, start with maxItems: 1000 to verify your field selection before processing millions of items.

📁 Use outputDatasetId for persistent storage — Write to a named dataset to build up a clean, deduplicated master dataset across multiple runs.

Common use cases

🛒 E-commerce price monitoring — Merge daily product scrapes and keep only the latest price per product URL.

📧 Lead generation — Combine contact lists from multiple sources and remove duplicate emails.

🏠 Real estate listings — Deduplicate property listings collected from different pages or search queries.

📰 News aggregation — Merge articles from multiple scraper runs and remove duplicates by headline or URL.

🔍 SEO auditing — Combine crawl results and deduplicate by page URL to get a clean site map.

Integrations

Dataset Deduplicator fits naturally into Apify workflows as a post-processing step.

🔗 Chained actors — Use Apify's actor-to-actor calling to trigger deduplication after a scraper finishes.

📅 Scheduled cleanup — Set up a schedule to periodically deduplicate accumulated data.

⚡ Webhooks — Trigger deduplication via webhooks when a scraper run succeeds.

🔌 Zapier / Make — Connect with Zapier or Make to automate dedup in multi-step workflows.

📊 Data pipelines — Use as a cleaning step before pushing data to Google Sheets, Airtable, databases, or BI tools.

API usage: how to use Dataset Deduplicator via API

You can run Dataset Deduplicator programmatically using the Apify API. Here are examples in multiple languages.

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/dataset-dedup').call({
    datasetIds: ['dataset-id-1', 'dataset-id-2'],
    fields: ['url'],
    keepOrder: 'first',
    caseInsensitive: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Got ${items.length} unique items`);

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("automation-lab/dataset-dedup").call(run_input={
    "datasetIds": ["dataset-id-1", "dataset-id-2"],
    "fields": ["url"],
    "keepOrder": "first",
    "caseInsensitive": True,
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Got {len(items)} unique items")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~dataset-dedup/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetIds": ["dataset-id-1", "dataset-id-2"],
    "fields": ["url"],
    "keepOrder": "first",
    "caseInsensitive": true
  }'

Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/dataset-dedup"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/dataset-dedup"
        }
    }
}

Example prompts

"I ran a scraper three times and got datasets abc123, def456, and ghi789 — can you merge and deduplicate them by URL?"
"Deduplicate this dataset by email address and keep the most recent record for each duplicate"
"I have two lead lists from different sources — combine them into one clean list with no duplicate emails"

Learn more in the Apify MCP documentation.

Legality: is it legal to deduplicate datasets?

✅ Yes. Dataset Deduplicator operates exclusively on your own Apify datasets. It does not scrape websites, access third-party systems, or collect any external data. It is a data cleaning utility that processes data you already own and have stored on the Apify platform.

There are no legal concerns with deduplicating your own data.

FAQ: frequently asked questions

Can I deduplicate a single dataset?

Yes. Pass a single dataset ID in the datasetIds array. The actor will remove duplicates within that one dataset.

What happens if a field is missing from some items?

Missing fields are treated as empty strings. Two items that both lack a field will match on that field. Combine multiple fields to avoid false matches.

Can I deduplicate on nested fields?

Currently, only top-level fields are supported. If you need to deduplicate on nested values, consider flattening your data first.

Does the actor modify the original datasets?

No. Source datasets are read-only. The actor writes deduplicated items to a new output dataset and never modifies the originals.

How many items can it handle?

The default limit is 1,000,000 items. You can adjust this with the maxItems parameter. The actor processes items in memory, so very large datasets may require higher memory allocation.

Can I use it with datasets from other actors?

Yes. Any valid Apify dataset ID works, regardless of which actor created it.

What format is the output?

The output dataset contains the exact same item structure as the input. No fields are added, removed, or transformed.

How do I find my dataset IDs?

In Apify Console, go to Storage > Datasets to see all your datasets with their IDs. You can also find the dataset ID in the "Dataset" tab of any actor run.

🔗 Dataset Deduplicator — This actor

Browse more tools from automation-lab on Apify Store.

Data Deduplicator

parsebird/dataset-deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

ParseBird

Dataset Deduplicator

zentrafoundry/dataset-deduplicator

Deduplicate Apify datasets and create stable merge keys.

Zentra

Deduplicate, Merge & Transform Datasets

datacach/deduplicate-datasets

Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.

DataCach

JSON Dataset Cleaner and Deduplicator

rodrgds/dataset-cleaner

Clean JSON datasets, remove empty rows, deduplicate by any field, validate emails, and prepare scraper output for CRMs, analysis, or AI workflows.

Rodrigo Dias

Lead List Deduplicator & Merger

jurassic_jove/lead-deduplicator-merger

Merge and deduplicate lead lists from multiple Apify datasets, CSV files and inline JSON into one clean, outreach-ready list. Pure data processor — no scraping, no proxies, no external APIs.

Data Runner

Data Cleaner

parsebird/data-cleaner

Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input.

ParseBird

Filter dataset records

analogous_ottoman/filter-records-based-on-negative-keywords

This actor lets you select a field in your dataset and exclude some records if they contain a keyword in the list of excluded keywords you provide (case-insensitive).

Analogous

Dataset Deduplicator v2

zentrafoundry/dataset-deduplicator-v2

Transform dataset deduplicator v2 inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Zentra

Data Cleaner & Normalizer (JSON/CSV)

zenomastro/data-cleaner-normalizer

Clean and normalize JSON/CSV data: trim whitespace, lowercase emails, normalize phone numbers and dates, drop empty values/rows, and deduplicate by a field.

Rosario Vitale

Fast Dataset Cleaner & CSV Formatter

motivational_nickel/dataset-cleaner-and-formatter

Fast dataset cleaning for CSV and JSON files. Automatically removes duplicates, trims whitespace, fixes capitalization, and normalizes fields. Works with Apify datasets or uploaded files and prepares data for analytics, CRM imports, and automation pipelines.

Leoncio Jr Coronado

Dataset Deduplicator

What does Dataset Deduplicator do?

Who is it for? Dataset Deduplicator users

Why use Dataset Deduplicator?

How much does it cost to deduplicate datasets?

How to deduplicate datasets step by step

Input parameters

Input example

Output data

Output example

Tips and best practices

Common use cases

Integrations

API usage: how to use Dataset Deduplicator via API

Node.js

Python

cURL

Use with Claude AI (MCP)

Setup for Claude Code

Setup for Claude Desktop, Cursor, or VS Code

Example prompts

Legality: is it legal to deduplicate datasets?

FAQ: frequently asked questions

Can I deduplicate a single dataset?

What happens if a field is missing from some items?

Can I deduplicate on nested fields?

Does the actor modify the original datasets?

How many items can it handle?

Can I use it with datasets from other actors?

What format is the output?

How do I find my dataset IDs?

Related tools

You might also like

Data Deduplicator

Dataset Deduplicator

Deduplicate, Merge & Transform Datasets

JSON Dataset Cleaner and Deduplicator

Lead List Deduplicator & Merger

Data Cleaner

Filter dataset records

Dataset Deduplicator v2

Data Cleaner & Normalizer (JSON/CSV)

Fast Dataset Cleaner & CSV Formatter