Data Deduplicator avatar

Data Deduplicator

Pricing

from $1.49 / 1,000 items processeds

Go to Apify Store
Data Deduplicator

Data Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

Pricing

from $1.49 / 1,000 items processeds

Rating

0.0

(0)

Developer

ParseBird

ParseBird

Maintained by Community

Actor stats

1

Bookmarked

4

Total users

2

Monthly active users

3 days ago

Last modified

Share

Data Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows automatically with case-insensitive matching and whitespace trimming built in.

Combine multiple Apify datasets and remove duplicates by URL, email, name + company, or any field combination. Case-insensitive matching and whitespace trimming built in.

ParseBird Infra Suite   •  Utility tools for data pipelines
🔗  HTTP Request
Send API calls from the cloud
📚  Data Deduplicator
➤ You are here
🗡  Data Cleaner
Clean nulls, normalize case, format phones & emails

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parsebird/dataset-deduplicator on Apify. Call: ApifyClient("TOKEN").actor("parsebird/dataset-deduplicator").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for deduplicated results. Key inputs: datasetIds (array of strings — Apify dataset IDs to merge), jsonData (array of objects — direct JSON input, alternative to datasetIds), fields (array of strings, required — field names for dedup key). Matching is case-insensitive with whitespace trimming. First occurrence is kept. Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~dataset-deduplicator (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

What does Data Deduplicator do?

This Actor merges one or more Apify datasets and removes duplicate rows based on fields you specify. It's the fastest way to clean up scraped data before analysis or export.

  • Single-field dedup — deduplicate by url, email, phone, or any single field
  • Composite key dedup — combine multiple fields like firstName + lastName + company to identify unique records
  • Smart matching — case-insensitive comparison with automatic whitespace trimming
  • Multi-dataset merge — combine items from multiple dataset IDs before deduplication
  • Direct JSON input — pass data directly as a JSON array instead of referencing datasets

How to use it (6 steps)

  1. Run your scraper(s) — collect data into one or more Apify datasets
  2. Copy the dataset ID(s) — find them in the Apify Console under your run's Storage tab
  3. Choose your dedup fields — pick the field(s) that uniquely identify each record
  4. Run this Actor — pass the dataset IDs and field names as input
  5. Get clean data — deduplicated items appear in the output dataset

Input parameters

ParameterTypeRequiredDefaultDescription
datasetIdsstring[]No*Apify dataset IDs to merge and deduplicate
jsonDataarrayNo*Direct JSON array of objects to deduplicate
fieldsstring[]YesField names for the dedup key

*Provide either datasetIds or jsonData (or both).

Composite key examples

Use caseFieldsEffect
Unique URLs["url"]One row per URL
Unique emails["email"]One row per email address
Unique people["firstName", "lastName", "company"]One row per person at each company
Unique products["sku", "marketplace"]One row per SKU per marketplace

Output example

Deduplicated items retain their original structure — no fields are added or removed:

[
{"name": "John Doe", "email": "john@example.com", "company": "Acme"},
{"name": "Jane Smith", "email": "jane@example.com", "company": "Beta"},
{"name": "Bob Wilson", "email": "bob@example.com", "company": "Gamma"}
]

A stats key is stored in the key-value store:

{
"totalLoaded": 5000,
"uniqueKept": 3200,
"duplicatesRemoved": 1800,
"datasetsProcessed": 3
}

How to use via API

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("parsebird/dataset-deduplicator").call(run_input={
"datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],
"fields": ["email"],
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Unique items: {len(items)}")

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('parsebird/dataset-deduplicator').call({
datasetIds: ['DATASET_ID_1', 'DATASET_ID_2'],
fields: ['firstName', 'lastName', 'company'],
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Unique items: ${items.length}`);

cURL

curl -X POST "https://api.apify.com/v2/acts/parsebird~dataset-deduplicator/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetIds": ["DATASET_ID_1"],
"fields": ["url"]
}'

Tips and best practices

  • Start with a single fieldurl or email usually covers most use cases
  • Use composite keys carefully — the more fields, the stricter the matching (fewer duplicates found)
  • Matching is always case-insensitive with whitespace trimming — no configuration needed

Pricing

This Actor uses a pay-per-event pricing model.

EventPrice per eventPrice per 1,000
items-processed$0.00149$1.49

Charged per 1,000 items loaded (not per unique item). Platform compute costs are additional.