Pricing

from $0.01 / 1,000 processed items

Go to Apify Store

Deduplicate, Merge & Transform Datasets

Try for free

Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.

Pricing

from $0.01 / 1,000 processed items

Rating

0.0

(0)

Developer

DataCach

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What does the Merge, Dedup & Transform Datasets Actor do?

This Actor is a data-processing utility (not a web scraper) for the Apify platform. In a single run it can:

🗂️ Merge datasets — concatenate any number of Apify datasets into one stream, in the order you list them.
🧹 Deduplicate items — keep only the first unique occurrence based on the fields you choose (dedupe by url, by id, by a name + id combination, or any fields you like).
🔎 Find only new items — feed yesterday's dataset as a filter so you output only records you haven't seen before.
✏️ Transform records — filter, enrich, or restructure items with custom Python functions before and/or after deduplication.
📤 Output flexibly — return the unique items, the removed duplicates, or just the duplicate counts, to a dataset or key-value store.
⚡ Scale to 10M+ items — stream data in batches with near-constant memory for very large jobs.

Because it runs on Apify, you also get everything the platform adds on top: schedule recurring deduplication, trigger runs through the Apify API or integrations (Make, Zapier, n8n, webhooks), chain this Actor right after your scrapers, monitor runs, and download the clean data in JSON, CSV, Excel, or HTML.

Why use this dataset deduplication tool?

Duplicate and scattered records are one of the most common headaches in web scraping and data collection. This Actor solves them without writing a single loop:

Clean up scraper output — remove duplicate products, listings, profiles, or URLs that the same crawl returned more than once.
Merge results from parallel runs — consolidate dozens of scraper runs (or every run of an Actor/Task) into one deduplicated dataset.
Build a "new items only" feed — diff today's data against yesterday's so downstream steps only process fresh records.
Deduplicate before export or import — hand clean, unique data to your database, spreadsheet, CRM, or BI tool.
Audit duplicates fast — get unique/duplicate counts without exporting anything.

It's built for both no-code users (point-and-click merging and deduplication) and developers (optional Python transforms and fine-grained performance controls).

How to deduplicate and merge datasets (step-by-step)

Add your Dataset IDs. Paste one or more Apify dataset IDs into the Dataset IDs field — or point the Actor at an Actor/Task ID to automatically pull all of its run datasets.
Choose your deduplication fields. List the field(s) that make an item unique — e.g. ["url"] or ["name", "id"]. Leave it empty to merge without deduplicating.
Pick what to output. Keep the default Unique items, or switch to Duplicate items or Nothing (count only).
(Optional) Transform your data. Add a Python expression to filter or reshape items before/after deduplication.
Click Run. When the run finishes, open the Output tab to preview and download your data in JSON, CSV, Excel, or HTML.

💡 Want to try it first? Run the Actor with an empty input. It processes a small built-in sample dataset so you can see exactly how merging and deduplication work before pointing it at your own data.

Input

Configure everything in the Input tab — no code required for the basics. The two most important fields are Dataset IDs and Deduplication fields; everything else is optional with sensible defaults. Click the Input tab for the full list of options and tooltips.

A minimal input looks like this:

{
  "datasetIds": ["dHZ7Xy9aBc...", "p9Kq2A4nMe..."],
  "fields": ["name", "id"],
  "output": "unique-items",
  "appendDatasetIds": true
}

Useful optional inputs include Pre/Post-deduplication transform (Python), Actor or Task ID (with Only runs newer/older than date filters), Dataset IDs of filter items (to output only new records), Output destination (dataset or key-value store), Deduplication mode, and limits & performance controls (offset, limit, fieldsToLoad, parallelLoads, parallelPushes, uploadBatchSize, batchSizeLoad).

Output

Output is pass-through: the unique (or duplicate) items keep all the fields from your source datasets. You can download the dataset extracted by this Actor in various formats such as JSON, HTML, CSV, or Excel, or fetch it via the API. If you choose key-value store output, the result is written to the OUTPUT record instead; results larger than the 9 MB record limit are automatically split into OUTPUT, OUTPUT-2, OUTPUT-3, and so on.

For input fields ["name", "id"], the Actor keeps the first record for every unique name + id and drops the rest:

[
  { "name": "Adidas Shoes", "id": 12345, "price": 100, "__datasetId__": "dHZ7Xy9aBc..." },
  { "name": "Nike Air",     "id": 1,     "price": 50,  "__datasetId__": "dHZ7Xy9aBc..." },
  { "name": "Puma Suede",   "id": 2,     "price": 30,  "__datasetId__": "p9Kq2A4nMe..." }
]

What data does this Actor output?

Field	Type	Description
(your fields)	any	Every field from your source datasets is preserved unchanged
`__datasetId__`	string	ID of the source dataset each item came from — added only when Append dataset IDs is enabled

How dataset deduplication works under the hood

For every item, the Actor builds a deduplication key by JSON-stringifying each of your chosen fields and joining them with a separator — so ["name", "id"] with name = "Adidas Shoes" and id = 12345 becomes the key "Adidas Shoes" + 12345. The separator guarantees that different value combinations can never blur into the same key. Objects and arrays are deep-compared (object keys are sorted), so field order never causes false mismatches. The Actor keeps the first item it sees for each key and treats later matches as duplicates. With Treat null fields as unique enabled, items whose key field is null or missing are always kept.

Two deduplication modes let you trade speed for memory:

Dedup after load (default) — loads everything, then deduplicates. Fastest for typical datasets.
Dedup as loading — deduplicates batch-by-batch with near-constant memory, ideal for 10M+ item jobs.

How much does it cost to deduplicate datasets?

This Actor is billed by Apify platform usage (the compute resources a run consumes), so cost scales with how many items you process. Because the heavy lifting runs on Polars with parallel loading and pushing, throughput is high and most deduplication jobs finish in a fraction of the time of naive loops. To estimate your cost, run it once on a representative dataset and check the run's usage, then scale roughly linearly. New Apify accounts include free monthly usage credits, which are enough to test and run small-to-medium jobs at no cost.

Tips for faster, cheaper deduplication

Load only what you need. Set Fields to load to just your deduplication fields when you only need counts or a subset — less data to transfer and process.
Use dedup-as-loading for huge datasets. It streams in batches and keeps memory near-constant for 10M+ item jobs.
Tune concurrency. Raise Parallel loads / Parallel pushes for faster runs on large datasets, or lower them to be gentle on resources.
Audit before exporting. Use Output → Nothing to get unique/duplicate counts without writing a full dataset.

Pair this Actor with any Apify Store scraper to automatically deduplicate and merge its results as a final clean-up step.

FAQ

Can I deduplicate by more than one field?

Yes. List several fields (e.g. ["name", "id"]) and an item is considered a duplicate only when all of those field values match a previous item.

How do I keep only items I haven't seen before?

Put your previous/known datasets in Dataset IDs of filter items. Their keys seed the "already seen" set, so matching records in your new datasets are dropped and only new unique items are output. The filter items themselves are never added to the output.

Can I merge datasets without removing duplicates?

Yes. Leave the Deduplication fields empty and the Actor will simply merge all your datasets into one, in the order you list them.

Does it work with millions of items?

Yes. Switch Deduplication mode to dedup-as-loading to stream items in batches with near-constant memory, suitable for 10M+ records.

Can I transform the data while deduplicating?

Yes. Add a Pre-deduplication and/or Post-deduplication transform — short Python that receives the items list (plus your optional Custom input data) and returns a new list. Use it to filter, rename fields, or enrich records.

Can I run it on a schedule or via the API?

Absolutely. Use Apify Schedules to run it automatically, or call it with the Apify API, CLI, or an integration (Make, Zapier, n8n). It's ideal as a final "clean-up" step after your scraping Actors.

What happens to fields that aren't deduplication keys?

They're preserved exactly as-is. Deduplication only uses your chosen fields to decide uniqueness; the full record is passed through unchanged.

Support, feedback, and disclaimers

Found a bug or have a feature request? Open an issue on the Actor's Issues tab — feedback is welcome and helps improve the tool. Need a custom data-processing workflow? Reach out for a tailored solution.

This Actor processes data you already own or are authorized to use in your Apify account; it does not scrape third-party websites. Always make sure your use of the underlying data complies with the relevant terms of service and data-protection regulations such as the GDPR.

Dataset Deduplicator

zentrafoundry/dataset-deduplicator

Deduplicate Apify datasets and create stable merge keys.

Zentra

Dataset Deduplicator

automation-lab/dataset-dedup

Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.

Stas Persiianenko

Data Deduplicator

parsebird/dataset-deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

ParseBird

Lead List Deduplicator & Merger

jurassic_jove/lead-deduplicator-merger

Merge and deduplicate lead lists from multiple Apify datasets, CSV files and inline JSON into one clean, outreach-ready list. Pure data processor — no scraping, no proxies, no external APIs.

Data Runner

Dataset Toolbox

cyberfly/dataset-toolbox

Perform common actions on datasets - merge, unify, validate, transform, order fields etc.

Vasek Codey Vlcek

Merge Key-Value store pieces

pocesar/merge-key-value-store-pieces

Paulo Cesar

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Lukáš Křivka

5.1K

5.0

Lead Intelligence Scorer Pro

leadops_lab/lead-intelligence-scorer-pro

Clean, deduplicate, score, and prepare B2B leads from Apify datasets for CRM import and outreach workflows.

jiaxun mao

42 Polars AI Data Transformer

salesmart-srl/polars-ai-data-transformer

Transform datasets using natural language. Upload CSV/Excel/JSON, describe your transformation in plain English, get results + reusable Python code. Powered by AI.

Salesmart Srl

🔥 Power Data Transformer

wiseek/power-data-transformer

🔥 Unlock your scraped data—clean, merge, split, deduplicate, filter, standardize, validate, enrich and sync—using built-in transformations and powerful SQL pipelines for ETL/ELT workflows. Seamlessly integrate processed datasets with automation platforms like n8n, Make.com, and Zapier.