Deduplicate, Merge & Transform Datasets
Pricing
from $0.01 / 1,000 processed items
Deduplicate, Merge & Transform Datasets
Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.
Pricing
from $0.01 / 1,000 processed items
Rating
0.0
(0)
Developer
DataCach
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Merge, deduplicate, and transform Apify datasets in one fast pass. The Merge, Dedup & Transform Datasets Actor lets you combine multiple datasets into one, remove duplicate items by any combination of fields, and optionally reshape your data with Python — then save the clean result back to a dataset or key-value store. It's powered by Polars, so it deduplicates millions of records quickly while keeping memory usage low.
No code is required for basic merging and deduplication — just paste your dataset IDs, choose the field(s) that make an item unique, and click Run. New here? Leave the input empty to try it instantly on built-in sample data. Whether you run parallel scrapers that return overlapping results, append to the same dataset every day, or just need a single tidy export, this dataset deduplication tool turns a messy pile of records into unique, ready-to-use data.
What does the Merge, Dedup & Transform Datasets Actor do?
This Actor is a data-processing utility (not a web scraper) for the Apify platform. In a single run it can:
- 🗂️ Merge datasets — concatenate any number of Apify datasets into one stream, in the order you list them.
- 🧹 Deduplicate items — keep only the first unique occurrence based on the fields you choose (dedupe by
url, byid, by aname+idcombination, or any fields you like). - 🔎 Find only new items — feed yesterday's dataset as a filter so you output only records you haven't seen before.
- ✏️ Transform records — filter, enrich, or restructure items with custom Python functions before and/or after deduplication.
- 📤 Output flexibly — return the unique items, the removed duplicates, or just the duplicate counts, to a dataset or key-value store.
- ⚡ Scale to 10M+ items — stream data in batches with near-constant memory for very large jobs.
Because it runs on Apify, you also get everything the platform adds on top: schedule recurring deduplication, trigger runs through the Apify API or integrations (Make, Zapier, n8n, webhooks), chain this Actor right after your scrapers, monitor runs, and download the clean data in JSON, CSV, Excel, or HTML.
Why use this dataset deduplication tool?
Duplicate and scattered records are one of the most common headaches in web scraping and data collection. This Actor solves them without writing a single loop:
- Clean up scraper output — remove duplicate products, listings, profiles, or URLs that the same crawl returned more than once.
- Merge results from parallel runs — consolidate dozens of scraper runs (or every run of an Actor/Task) into one deduplicated dataset.
- Build a "new items only" feed — diff today's data against yesterday's so downstream steps only process fresh records.
- Deduplicate before export or import — hand clean, unique data to your database, spreadsheet, CRM, or BI tool.
- Audit duplicates fast — get unique/duplicate counts without exporting anything.
It's built for both no-code users (point-and-click merging and deduplication) and developers (optional Python transforms and fine-grained performance controls).
How to deduplicate and merge datasets (step-by-step)
- Add your Dataset IDs. Paste one or more Apify dataset IDs into the Dataset IDs field — or point the Actor at an Actor/Task ID to automatically pull all of its run datasets.
- Choose your deduplication fields. List the field(s) that make an item unique — e.g.
["url"]or["name", "id"]. Leave it empty to merge without deduplicating. - Pick what to output. Keep the default Unique items, or switch to Duplicate items or Nothing (count only).
- (Optional) Transform your data. Add a Python expression to filter or reshape items before/after deduplication.
- Click Run. When the run finishes, open the Output tab to preview and download your data in JSON, CSV, Excel, or HTML.
💡 Want to try it first? Run the Actor with an empty input. It processes a small built-in sample dataset so you can see exactly how merging and deduplication work before pointing it at your own data.
Input
Configure everything in the Input tab — no code required for the basics. The two most important fields are Dataset IDs and Deduplication fields; everything else is optional with sensible defaults. Click the Input tab for the full list of options and tooltips.
A minimal input looks like this:
{"datasetIds": ["dHZ7Xy9aBc...", "p9Kq2A4nMe..."],"fields": ["name", "id"],"output": "unique-items","appendDatasetIds": true}
Useful optional inputs include Pre/Post-deduplication transform (Python), Actor or Task ID (with Only runs newer/older than date filters), Dataset IDs of filter items (to output only new records), Output destination (dataset or key-value store), Deduplication mode, and limits & performance controls (offset, limit, fieldsToLoad, parallelLoads, parallelPushes, uploadBatchSize, batchSizeLoad).
Output
Output is pass-through: the unique (or duplicate) items keep all the fields from your source datasets. You can download the dataset extracted by this Actor in various formats such as JSON, HTML, CSV, or Excel, or fetch it via the API. If you choose key-value store output, the result is written to the OUTPUT record instead.
For input fields ["name", "id"], the Actor keeps the first record for every unique name + id and drops the rest:
[{ "name": "Adidas Shoes", "id": 12345, "price": 100, "__datasetId__": "dHZ7Xy9aBc..." },{ "name": "Nike Air", "id": 1, "price": 50, "__datasetId__": "dHZ7Xy9aBc..." },{ "name": "Puma Suede", "id": 2, "price": 30, "__datasetId__": "p9Kq2A4nMe..." }]
What data does this Actor output?
| Field | Type | Description |
|---|---|---|
| (your fields) | any | Every field from your source datasets is preserved unchanged |
__datasetId__ | string | ID of the source dataset each item came from — added only when Append dataset IDs is enabled |
How dataset deduplication works under the hood
For every item, the Actor builds a deduplication key by JSON-stringifying each of your chosen fields and concatenating them — so ["name", "id"] with name = "Adidas Shoes" and id = 12345 becomes the key "Adidas Shoes"12345. Objects and arrays are deep-compared (object keys are sorted), so field order never causes false mismatches. The Actor keeps the first item it sees for each key and treats later matches as duplicates. With Treat null fields as unique enabled, items whose key field is null or missing are always kept.
Two deduplication modes let you trade speed for memory:
- Dedup after load (default) — loads everything, then deduplicates. Fastest for typical datasets.
- Dedup as loading — deduplicates batch-by-batch with near-constant memory, ideal for 10M+ item jobs.
How much does it cost to deduplicate datasets?
This Actor is billed by Apify platform usage (the compute resources a run consumes), so cost scales with how many items you process. Because the heavy lifting runs on Polars with parallel loading and pushing, throughput is high and most deduplication jobs finish in a fraction of the time of naive loops. To estimate your cost, run it once on a representative dataset and check the run's usage, then scale roughly linearly. New Apify accounts include free monthly usage credits, which are enough to test and run small-to-medium jobs at no cost.
Tips for faster, cheaper deduplication
- Load only what you need. Set Fields to load to just your deduplication fields when you only need counts or a subset — less data to transfer and process.
- Use
dedup-as-loadingfor huge datasets. It streams in batches and keeps memory near-constant for 10M+ item jobs. - Tune concurrency. Raise Parallel loads / Parallel pushes for faster runs on large datasets, or lower them to be gentle on resources.
- Audit before exporting. Use Output → Nothing to get unique/duplicate counts without writing a full dataset.
Related Actors
- Pair this Actor with any Apify Store scraper to automatically deduplicate and merge its results as a final clean-up step.
FAQ
Can I deduplicate by more than one field?
Yes. List several fields (e.g. ["name", "id"]) and an item is considered a duplicate only when all of those field values match a previous item.
How do I keep only items I haven't seen before?
Put your previous/known datasets in Dataset IDs of filter items. Their keys seed the "already seen" set, so matching records in your new datasets are dropped and only new unique items are output. The filter items themselves are never added to the output.
Can I merge datasets without removing duplicates?
Yes. Leave the Deduplication fields empty and the Actor will simply merge all your datasets into one, in the order you list them.
Does it work with millions of items?
Yes. Switch Deduplication mode to dedup-as-loading to stream items in batches with near-constant memory, suitable for 10M+ records.
Can I transform the data while deduplicating?
Yes. Add a Pre-deduplication and/or Post-deduplication transform — short Python that receives the items list (plus your optional Custom input data) and returns a new list. Use it to filter, rename fields, or enrich records.
Can I run it on a schedule or via the API?
Absolutely. Use Apify Schedules to run it automatically, or call it with the Apify API, CLI, or an integration (Make, Zapier, n8n). It's ideal as a final "clean-up" step after your scraping Actors.
What happens to fields that aren't deduplication keys?
They're preserved exactly as-is. Deduplication only uses your chosen fields to decide uniqueness; the full record is passed through unchanged.
Support, feedback, and disclaimers
Found a bug or have a feature request? Open an issue on the Actor's Issues tab — feedback is welcome and helps improve the tool. Need a custom data-processing workflow? Reach out for a tailored solution.
This Actor processes data you already own or are authorized to use in your Apify account; it does not scrape third-party websites. Always make sure your use of the underlying data complies with the relevant terms of service and data-protection regulations such as the GDPR.