Merge, Dedup & Transform Datasets

Pricing

Pay per usage

Try for free

Go to Apify Store

Merge, Dedup & Transform Datasets

Try for free

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Pricing

Pay per usage

Rating

5.0

(1)

Developer

Lukáš Křivka

Maintained by Community

Actor stats

Bookmarked

4.7K

Total users

Monthly active users

18 hours

Issues response

10 months ago

Last modified

The ultimate dataset processing actor - merge, dedup & transform

Refined and optimized dataset processing actor for large scale merging, deduplications and transformation

Why to use this actor

Extremely fast data processing thanks for parallelizing workloads (easily 20x faster than default loading/pushing datasets)
Allows reading from multiple datasets silmutanesously, ideal for merging after scraping with many runs
Actor migration proof - All steps that can be persisted are persisted => work is not repeated and no duplicated data pushed
Dedup as loading mode allows for near constant memory processing even for huge datasets (think 10M+)
Deduplication allows for combination of many fields and even nested objects/arrays (those are JSON.stringified for deep equality check)
Allows for storing into KV store records
Allows super fast blank runs that count duplicates

Merging

You can provide more than one dataset. In that case all items are merged into single dataset or key value store output. If you use the Dedup after load mode, the order of items will retain the order of datasets provided.

Deduplication

If you optionally provide deduplication fields, this actor will deduplicate the dataset items. The deduplication process check the values of each field for equality and only return the first unique one (the first item that has a unique value for that field).

You can provide more than one field. In that case a combined string of that fields is checked, e.g. "name": "Adidas Shoes, "id": "12345" gets converted into "Adidas Shoes12345" for the checking purpose. So only items that have both fields the same are considered duplicates. This means the more fields you add, the less duplicates will be found.

Fields that are objects or arrays are also deeply compared via JSON.stringify. Just be aware that doing this for very large structures might have performance implications.

Transformation

This actor enables you to do arbitrary data transformations before and after deduplication via preDedupTransformFunction and postDedupTransformFunction.

These functions simply take the array of items and should return array of items. You don't need to necessarily return the same amount of items (can filter some out or add new ones).

You can access an object with helper variables, currently containing the Apify SDK reference

The default transformation does nothing with the items:

(items, { Apify, customInputData }) => {
    return items;
}

In case of dedup-as-loading mode, you only have access to the items of the specific batch. But you can also access datasetId and datasetOffset parameters as each batch is only from one dataset.

(items, { Apify, datasetId, datasetOffset, customInputData }) => {
    return items;
}

Input

Detailed INPUT table with description can be found on the actor's public page.

Changelog

Check the list of past updates here

🔥 Power Data Transformer

wiseek/power-data-transformer

🔥 Unlock your scraped data—clean, merge, split, deduplicate, filter, standardize, validate, enrich and sync—using built-in transformations and powerful SQL pipelines for ETL/ELT workflows. Seamlessly integrate processed datasets with automation platforms like n8n, Make.com, and Zapier.

wiseek

Contact Details Merge & Deduplicate

lukaskrivka/contact-details-merge-deduplicate

Merge and deduplicate all contacts extracted by Contact Details Scraper. Works with multiple datasets. One row per domain.

Lukáš Křivka

115

Product Matching Vectorizer

tri_angle/product-matching-vectorizer

Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.

Tri⟁angle

E-commerce Product Matching Tool

tri_angle/e-commerce-product-matching-tool

Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.

Tri⟁angle

Dataset(s) To Schema

zuzka/dataset-to-schema

Takes a Dataset ID(s) and outputs a JSON schema of the contents of the dataset into key value store.

Zuzka Pelechová

5.0

Truth Social Scraper

tri_angle/truth-scraper

Scrape profile info, truths and replies from the Truth social media platform.

Tri⟁angle

225

5.0

Twitch Video Downloader

bytepulselabs/twitch-video-downloader

Download clips from Twitch by adding one or more Twitch clip URLs to extract embedded videos. Then, save downloaded media files, run the downloader via API, schedule and monitor downloads, or integrate with other tools for automated video archiving.

BytePulse Labs

5.0

GIF Scroll Animation

glenn/gif-scroll-animation

Free tool to automatically create an animated GIF of any scrolling web page. Useful for testing UX, showcasing your work, and capturing any website as a GIF, including clickable elements and animations. Includes settings to adjust speed, wait before scrolling, slow down on-page animations, and more.