Merge, Dedup & Transform Datasets
No credit card required
Merge, Dedup & Transform Datasets
No credit card required
The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.
Do you want to learn more about this Actor?
Get a demoHi! This is more of a feature request (or maybe help in case it is already possible) to perform merging and deduplication to a single dataset.
Lemme illustrate this with an example. Say I have a schedule that runs a defined Task every 2 minutes. This schedule will run the task and create a small, unnamed dataset every 2 minutes.
Let's say that I want to merge all these small datasets that get created into one, single, well defined and named dataset that I created myself for this purpose. This ID will be set in the "outputDatasetId".
Using your beloved tool, I pass in the {{resource.datsetId}} to the input, AS WELL as the big, named dataset. I also pass a "url" parameter to the "fields", to perform deduplication. I use webooks to run this every time the run is successful.
This will perform deduplication contrasting the small, new dataset against the old, big datset. This difference will then be pushed into the old datset. The problem is that this creates duplicates again and again in the old dataset. Imagine that the big, named dataset and the new, small dataset have no items in common. That means the "difference" dataset will be the union of both sets. This will be pushed to the big dataset, creating duplicates of all already existing items, as well as adding the small number of new items. I hope I am getting my point accross, but feel free to ask.
Essentially, this would not happen if we were able to either:
- perform deduplication against the outputDa... [trimmed]
Hi,
thanks for reaching out, we'll look into it and let you know what can be done.
Hello,
I'm sorry we forgot to reply, we actually discussed this issue internally.
The Actor actually already has an input Dataset IDs for just deduping
where you can put the big dataset ID. This way you only get new entries that are not in the big dataset (by the same key like "url").
You can also add the big dataset ID as outputDatasetId
to automatically grow it with only entries. If you need a separate output dataset, you will need to attach another with webhook to load that dataset and push it into the big one.
For example for a bit more complex workflow, you can check the combination of these 2 Actors:
hey! thanks for your answer.
it seems like my original question got trimmed, which is quite annoying since key parts of the explanation were there.
can you show me an example of a json configuration where you use said "Dataset IDs for just deduping"? i cannot seem to find it in the json editor here in apify.
also, since i am doing a single serverless app, i am mainly leveraging apify's enpoints rather than using their SDK, which is sometimes a tad more difficult to work with in the particular situation i have.
thanks much for your time?
Hello,
It is called
"datasetIdsOfFilterItems": []
Happy to help
Actor Metrics
180 monthly users
-
55 stars
96% runs succeeded
1.1 days response time
Created in Apr 2020
Modified 5 days ago