Merge, Dedup & Transform Datasets avatar

Merge, Dedup & Transform Datasets

Try for free

No credit card required

Go to Store
Merge, Dedup & Transform Datasets

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets
Try for free

No credit card required

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Do you want to learn more about this Actor?

Get a demo
GR

Merge and Dedup to a single dataset

Closed

graphext opened this issue
5 months ago

Hi! This is more of a feature request (or maybe help in case it is already possible) to perform merging and deduplication to a single dataset.

Lemme illustrate this with an example. Say I have a schedule that runs a defined Task every 2 minutes. This schedule will run the task and create a small, unnamed dataset every 2 minutes.

Let's say that I want to merge all these small datasets that get created into one, single, well defined and named dataset that I created myself for this purpose. This ID will be set in the "outputDatasetId".

Using your beloved tool, I pass in the {{resource.datsetId}} to the input, AS WELL as the big, named dataset. I also pass a "url" parameter to the "fields", to perform deduplication. I use webooks to run this every time the run is successful.

This will perform deduplication contrasting the small, new dataset against the old, big datset. This difference will then be pushed into the old datset. The problem is that this creates duplicates again and again in the old dataset. Imagine that the big, named dataset and the new, small dataset have no items in common. That means the "difference" dataset will be the union of both sets. This will be pushed to the big dataset, creating duplicates of all already existing items, as well as adding the small number of new items. I hope I am getting my point accross, but feel free to ask.

Essentially, this would not happen if we were able to either:

  • perform deduplication against the outputDa... [trimmed]
paja avatar

Hi,

thanks for reaching out, we'll look into it and let you know what can be done.

lukaskrivka avatar

Hello,

I'm sorry we forgot to reply, we actually discussed this issue internally.

The Actor actually already has an input Dataset IDs for just deduping where you can put the big dataset ID. This way you only get new entries that are not in the big dataset (by the same key like "url").

You can also add the big dataset ID as outputDatasetId to automatically grow it with only entries. If you need a separate output dataset, you will need to attach another with webhook to load that dataset and push it into the big one.

For example for a bit more complex workflow, you can check the combination of these 2 Actors:

GR

graphext

2 months ago

hey! thanks for your answer.

it seems like my original question got trimmed, which is quite annoying since key parts of the explanation were there.

can you show me an example of a json configuration where you use said "Dataset IDs for just deduping"? i cannot seem to find it in the json editor here in apify.

also, since i am doing a single serverless app, i am mainly leveraging apify's enpoints rather than using their SDK, which is sometimes a tad more difficult to work with in the particular situation i have.

thanks much for your time?

lukaskrivka avatar

Hello,

It is called

"datasetIdsOfFilterItems": []

Happy to help

Developer
Maintained by Apify

Actor Metrics

  • 174 monthly users

  • 61 stars

  • 98% runs succeeded

  • 0.57 hours response time

  • Created in Apr 2020

  • Modified a month ago