Loads data from Apify Dataset, Key Value store or an arbitrary JSON and checks each item against all others for duplicate field.
The check takes seconds to a maximum of a few minutes for larger datasets.
Produces a report so you know exactly how many problems are there and which items contained them.
It is very useful to append this actor as a webhook. You can easily chain another actor after this one to send an email or add a report to your Google Sheets to name just a few examples. Check Apify Store for more.

How it works

Loads data in batches into memory (Key Value store or raw data are loaded all at once).
Each item in the batch is scanned for the provided field. Actor keeps track of previous occurences and count duplicates.
A report is created after the whole run and saved as OUTPUT to the default Key Value store.
Between each batch, the state of the actor is saved so it doesn't have to repeat the work after restart(migration).

Input

This actor expects a JSON object as an input. You can also set it up in a visual UI editor on Apify. You can find examples in the Input and Example Run tabs of the actor page in Apify Store. All the input fields (regardless of section) are top level fields.

Main input fields

datasetId <string> Id of dataset where the data are located. If you need to use other input types like Key value store or raw JSON, use keyValueStoreRecord or rawData You have specify this, keyValueStoreRecord or rawData but only one of them
checkOnlyCleanItems <boolean> Only clean dataset items will be loaded and use for duplications checking if datasetId option is provided. Default: false
fields <array> List of fields in each item that will be checked for duplicates. Each field must not be nested and it should contain only simple value (string or number). It is also possible to use option field to pass only single <string> value (due to backward compatibility). You can prepare your data with preCheckFunction. Required
preCheckFunction <stringified function> Stringified javascipt function that can apply arbitrary transformation to the input data before the check. See preCheckFunction section. Optional
minDuplications <number> Minimum occurences to be included in the report. Default: 2

Show options

showIndexes: <boolean> Indexes of the duplicate items will be shown in the OUTPUT report. Set to false if you don't need them. Default: true
showItems: <boolean> Duplicate items will be pushed to a dataset. Set to false if you don't need them. Default: true
showMissing: <boolean> Items where the values for the field is missing or is null or '' will be included in the report Default: true

Dataset pagination options

limit: <number> How many items will be checked. Default: all
offset: <number> From which item the checking will start. Use with limit to check specific items. Default: 0
batchSize: <number> You can change number of loaded and processed items in each batch. This is only needed to be changed if you have really huge items. Default: 1000

Other data sources

keyValueStoreRecord <string> ID and record key if you want to load from KV store. Format is {keyValueStoreId}+{recordKey}, e.g. s5NJ77qFv8b4osiGR+MY-KEY. You have specify this, datasetId or rawData but only one of them
rawData <array> Array of objects to be checked. You have specify this, keyValueStoreRecord or datasetId but only one of them*.

preCheckFunction

preCheckFunction is useful to transform the input data before the actual check. Its main usefulness is to ensure that the field you are checking is a top level (not nested) field and that the value of that field is a simple value like number or string (The decision to not allow deep equality check for nested structures was made for simplicity and performance reasons).

So for example, let's say you have an item with a nested field images:

[{
  "url": "https://www.bloomingdales.com/shop/product/lauren-ralph-lauren-ruffled-georgette-dress?ID=3493626&CategoryID=1005206",
  "images": [
    {
      "src": "https://images.bloomingdalesassets.com/is/image/BLM/products/9/optimized/10317399_fpx.tif",
      "cloudPath": ""
    }
  ],
  ... // more fields that you are not interested in
}]

If you want to check the first image URL for duplications and keep the item url for a reference, you can easily transform the whole data with simple preCheckFunction:

(data) => data.map((item) => ({ url: item.url, imageUrl: item.images[0].src }))

Now, set field in input to imageUrl and all will work nicely.

Report

At the end of the actor run, the report is saved to the default Key Value store as an OUTPUT. Also, if showItems is true, it will push duplicate items to the dataset.

By default, the report will include all information but you can opt-out if you set any of showIndexes, showItems, showMissing to false.

Report is an object where every field value that appeared at least twice (which means it was duplicate) is inluced as a key. For each of them, report contains count (minimum is 2), originalIndexes (which are indexes of items in your original dataset or after preCheckFunction) and outputIndexes (only present when showItems is enabled). The indexes should help you navigate the duplicates in your data.

OUTPUT example

{
  "https://images.bloomingdalesassets.com/is/image/BLM/products/4/optimized/9153524_fpx.tif": {
    "count": 2,
    "originalIndexes": [
      166,
      202
    ],
    "outputIndexes": [
      0,
      1
    ]
  },
  "https://images.bloomingdalesassets.com/is/image/BLM/products/9/optimized/9832349_fpx.tif": {
    "count": 2,
    "originalIndexes": [
      1001,
      1002
    ],
    "outputIndexes": [
      2,
      3
    ]
  }
}

The items are intentionally not included in the OUTPUT report to reduce its size. Instead they are pushed to the default dataset and you can locate them with outputIndexes. If you need to connect the OUTPUT with the dataset for deeper analysis, you can find the items with the help of indexes.

Checking more fields

The first version of the actor had the option to check more fields at once but it produced very complicated output and the implementation was too convoluted so I decided to abandon the idea for simplicity. In case you want to check more fields, simply run it once for each field. Since the actor consumption is pretty low, it is not a big deal.

More info coming soon!

Epilogue

If you find any problem or would like to add a new feature, please create an issue on the Github repo.

Thanks everybody for using it and giving any feedback!

On this page

- Duplications Checker

Share Actor:

Dice Search Scraper

axlymxp/dice-search-scraper

A web scraper that extracts job listings from Dice.com based on search criteria like keywords, location, and radius. It retrieves detailed job information including title, company, location, description and more. Built as an Apify actor for easy integration.

axly

Soon-to-Open Businesses Leads Scraper (Google Maps)

xmiso_scrapers/soon-to-open-businesses-leads-scraper-google-maps

Get unique leads of various businesses planning opening in near future like restaurants, bars, beauty salons, dentists etc. and offer them services they might need when launching their business. Scraped from Google Maps with emails and social links including where available.

Miso

5.0

Dice Scraper

deltaspider/dice-scraper

Automatically and efficiently scrape Dice.com job postings

delta spider

🔥Dice.com FULL Job Scraper🔥

mohamedgb00714/dicecom-job-scraper

Scrapes job listings from Dice.com, including detailed information, and handles pagination. Supports keyword search, location search, and various filters (employment type, employer type, workplace type, posted date, easy apply, willing to sponsor).

mohamed el hadi msaid

5.0

Job Listings Aggregator Pro

assertive_analogy/Job-Listings-Aggregator

Job Listings Aggregator – Find Jobs Fast! Search 8+ top job boards (LinkedIn, Indeed, RemoteOK, Dice, more) in one click. Get Python, tech & remote roles with smart deduplication, keyword filters & instant results. Supercharge your job hunt with this powerful, all-in-one Python scraper!

Gideon Nesh

Dice Jobs Scraper

worldunboxer/dice-jobs-scraper

Boost your job search with our Dice Job Scraper! Easily extract job listings, company details, salaries, and full job descriptions from Dice.com. Automate job scraping with high accuracy and efficiency. Perfect for recruiters, analysts, and job seekers. Get real-time job data instantly!

Umesh Patidar

5.0

Dice.com Job Scraper

easyapi/dice-com-job-scraper

Unlock the tech job market with our Dice.com Job Scraper! Extract detailed listings effortlessly, including salaries, remote options, and more. Perfect for recruiters, job seekers, and researchers. Get valuable insights into the latest tech career opportunities!

EasyApi

5.0

Actor Testing

pocesar/actor-testing

Test your actors with varying inputs and expected outputs, duplicates, bad output fields, or unexpected log messages using Jasmine

Paulo Cesar

Dice.com Jobs Scraper

piotrv1001/dice-com-jobs-scraper

The Dice.com Jobs Scraper extracts US tech job listings from Dice.com based on search keywords and location (state), capturing salary details, remote work status, company logo, job URL, and job descriptions. Ideal for job market analysis and recruitment insights.

Piotr Vassev

Fastest Dice.com Job(s) Scraper (Richest Output)

memo23/apify-dice-scraper

Extract data from Dice.com including detailed job descriptions, company profiles, locations, salaries, and application details. Get structured data on employment types, remote options, posting dates, and company information. Monitor new postings and track changes over time with our scraping solution