Pricing

Pay per usage

Go to Store

Merge, Dedup & Transform Datasets

Try for free

Developed by

Lukáš Křivka

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

0.0 (0)

Pricing

Pay per usage

Last modified

6 months ago

Automation

Developer tools

Open source

Dataset IDs

datasetIdsarrayOptional

Datasets that should be deduplicated and merged

Fields for deduplication

fieldsarrayOptional

Fields whose combination should be unique for the item to be considered unique. If none are provided, the actor does not perform deduplication.

What to output

outputEnumOptional

What will be pushed to the dataset from this actor

Value options:

"unique-items": string"duplicate-items": string"nothing": string

Default value of this property is "unique-items"

Mode

modeEnumOptional

How the loading and deduplication process will work.

Value options:

"dedup-after-load": string"dedup-as-loading": string

Default value of this property is "dedup-after-load"

Output dataset ID or name (optional)

outputDatasetIdstringOptional

Optionally can push into dataset of your choice. If you provide a dataset name that doesn't exist, a new named dataset will be created.

Limit fields to load

fieldsToLoadarrayOptional

You can choose which fields to load only. Useful to speed up the loading and reduce memory needs.

Pre dedup transform function

preDedupTransformFunctionstringOptional

Function to transform items before deduplication is applied. For 'dedup-after-load' mode this is done for all items at once. For 'dedup-as-loading' this is applied to each batch separately.

Post dedup transform function

postDedupTransformFunctionstringOptional

Function to transform items after deduplication is applied. For 'dedup-after-load' mode this is done for all items at once. For 'dedup-as-loading' this is applied to each batch separately.

Actor or Task ID (or name)

actorOrTaskIdstringOptional

Use Actor or Task ID (e.g. nwua9Gu5YrADL7ZDj) or full name (e.g. apify/instagram-scraper).

Only runs newer than

onlyRunsNewerThanstringOptional

Use a date format of either YYYY-MM-DD or with time YYYY-MM-DDTHH:mm:ss.

Only runs older than

onlyRunsOlderThanstringOptional

Use a date format of either YYYY-MM-DD or with time YYYY-MM-DDTHH:mm:ss.

Where to output

outputToEnumOptional

Either can output to a single dataset or to split data into KV records depending on upload batch size. KV is upload is much faster but data end up in many files.

Value options:

"dataset": string"key-value-store": string

Default value of this property is "dataset"

Parallel loads

parallelLoadsintegerOptional

Datasets can be loaded in parallel batches to speed things up if needed.

Default value of this property is 10

Parallel pushes

parallelPushesintegerOptional

Deduped data can be pushed in parallel batches to speed things up if needed. If you want the data to be in the exact same order, you need to set this to 1.

Default value of this property is 5

Upload batch size

uploadBatchSizeintegerOptional

How many items it should upload in one pushData call. Useful to not overload Apify API. Only important for dataset upload.

Default value of this property is 500

Download batch size

batchSizeLoadintegerOptional

How many items it will load in a single batch.

Default value of this property is 50000

Offset (how many items to skip from start)

offsetintegerOptional

By default we don't skip any items which is the same as setting offset to 0. For multiple datasets, it takes offset into the sum of their item counts but that is not very useful.

Limit (how many items to load)

limitintegerOptional

By default we don't limit the number loaded items

verbose log

verboseLogbooleanOptional

Good for smaller runs. Large runs might run out of log space.

Default value of this property is false

Null fields are unique

nullAsUniquebooleanOptional

If you want to treat null (or missing) fields as always unique items.

Default value of this property is false

Dataset IDs for just deduping

datasetIdsOfFilterItemsarrayOptional

The items from these datasets will be just used as a dedup filter for the main datasets. These items are loaded first and then the main datasets are compared for uniqueness and pushed.

Custom input data

customInputDataobjectOptional

You can pass custom data as a JSON object to be accessible in the transform functions as part of the 2nd parameter object.

Append dataset IDs to items

appendDatasetIdsbooleanOptional

Useful for transform functions. Each item will get a field __datasetId__ with the dataset ID it came from.

Default value of this property is false

Dataset Toolbox

cyberfly/dataset-toolbox

Perform common actions on datasets - merge, unify, validate, transform, order fields etc.

Vasek Codey Vlcek

Forward dataset as POST data

anchor/forward-dataset-webhook

This actor forwards the results of an Actor to an endpoint, instead of having to fetch the results manually. It will download the dataset and attach it to the body of a POST request you will specify. It acts as a new webhook. Simplify your Actor process !!!

Anchor

Contact Details Merge & Deduplicate

lukaskrivka/contact-details-merge-deduplicate

Merge and deduplicate all contacts extracted by Contact Details Scraper. Works with multiple datasets. One row per domain.

Lukáš Křivka

API / JSON scraper

pocesar/json-downloader

Scrape any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Paulo Cesar

540

Dataset Query Engine

jiri.spilka/dataset-query-engine

Use natural language queries to retrieve results from an Apify dataset. This Actor provides a query engine that loads a dataset, executes SQL queries, and synthesizes results.

Jiří Spilka

4.7

Save To S3

drinksight/save-to-s3

Designed to be run from an ACTOR.RUN.SUCCEEDED webhook, this actor downloads a task run's default dataset and saves it to an S3 bucket.

Richard Weaver

101

Actor Builder

useful-tools/actor-builder

Simple Actor which allows to create another Actors build. Used mostly for integration purposes.

Useful tools

AI Product Recommendation Agent

matymar/ai-product-recommendation-agent

The AI Product Recommendation Agent helps users find the best products based on their needs using a simple query. It analyzes product listings, reviews, and ratings to provide well-informed recommendations.

Matouš Mařík

5.0

Power Webhook Integration

pocesar/run-webhook-digest

Allows you to provide multiple HTTP endpoints, that receive a more complete JSON from the run, and allow you to hit those endpoints using a proxy, and enable you to do conditional webhook calls with some lines of Javascript code and you can link/chain one actor to another

Paulo Cesar

🎯 Instagram Ad Copywriter Creator

powerai/instagram-ad-copywriter-creator

Transform your Instagram marketing with AI-powered ad copy! This intelligent tool analyzes your product and target audience to generate compelling, conversion-focused Instagram ad content with professional headlines, persuasive body copy, and strategic CTAs.