Merge, Dedup & Transform Datasets
No credit card required
Merge, Dedup & Transform Datasets
No credit card required
The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.
Do you want to learn more about this Actor?
Get a demoFields for deduplication
fields
arrayOptional
Fields whose combination should be unique for the item to be considered unique. If none are provided, the actor does not perform deduplication.
What to output
output
EnumOptional
What will be pushed to the dataset from this actor
Value options:
"unique-items": string"duplicate-items": string"nothing": string
Default value of this property is "unique-items"
Mode
mode
EnumOptional
How the loading and deduplication process will work.
Value options:
"dedup-after-load": string"dedup-as-loading": string
Default value of this property is "dedup-after-load"
Output dataset ID or name (optional)
outputDatasetId
stringOptional
Optionally can push into dataset of your choice. If you provide a dataset name that doesn't exist, a new named dataset will be created.
Limit fields to load
fieldsToLoad
arrayOptional
You can choose which fields to load only. Useful to speed up the loading and reduce memory needs.
Pre dedup transform function
preDedupTransformFunction
stringOptional
Function to transform items before deduplication is applied. For 'dedup-after-load' mode this is done for all items at once. For 'dedup-as-loading' this is applied to each batch separately.
Post dedup transform function
postDedupTransformFunction
stringOptional
Function to transform items after deduplication is applied. For 'dedup-after-load' mode this is done for all items at once. For 'dedup-as-loading' this is applied to each batch separately.
Actor or Task ID (or name)
actorOrTaskId
stringOptional
Use Actor or Task ID (e.g. nwua9Gu5YrADL7ZDj
) or full name (e.g. apify/instagram-scraper
).
Only runs newer than
onlyRunsNewerThan
stringOptional
Use a date format of either YYYY-MM-DD
or with time YYYY-MM-DDTHH:mm:ss
.
Only runs older than
onlyRunsOlderThan
stringOptional
Use a date format of either YYYY-MM-DD
or with time YYYY-MM-DDTHH:mm:ss
.
Where to output
outputTo
EnumOptional
Either can output to a single dataset or to split data into KV records depending on upload batch size. KV is upload is much faster but data end up in many files.
Value options:
"dataset": string"key-value-store": string
Default value of this property is "dataset"
Parallel loads
parallelLoads
integerOptional
Datasets can be loaded in parallel batches to speed things up if needed.
Default value of this property is 10
Parallel pushes
parallelPushes
integerOptional
Deduped data can be pushed in parallel batches to speed things up if needed. If you want the data to be in the exact same order, you need to set this to 1.
Default value of this property is 5
Upload batch size
uploadBatchSize
integerOptional
How many items it should upload in one pushData call. Useful to not overload Apify API. Only important for dataset upload.
Default value of this property is 500
Download batch size
batchSizeLoad
integerOptional
How many items it will load in a single batch.
Default value of this property is 50000
Offset (how many items to skip from start)
offset
integerOptional
By default we don't skip any items which is the same as setting offset to 0. For multiple datasets, it takes offset into the sum of their item counts but that is not very useful.
verbose log
verboseLog
booleanOptional
Good for smaller runs. Large runs might run out of log space.
Default value of this property is false
Null fields are unique
nullAsUnique
booleanOptional
If you want to treat null (or missing) fields as always unique items.
Default value of this property is false
Dataset IDs for just deduping
datasetIdsOfFilterItems
arrayOptional
The items from these datasets will be just used as a dedup filter for the main datasets. These items are loaded first and then the main datasets are compared for uniqueness and pushed.
Actor Metrics
150 monthly users
-
54 stars
98% runs succeeded
3.6 days response time
Created in Apr 2020
Modified 2 months ago