Data Deduplicator
Pricing
from $1.49 / 1,000 items processeds
Data Deduplicator
Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.
Pricing
from $1.49 / 1,000 items processeds
Rating
0.0
(0)
Developer
ParseBird
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Data Deduplicator
Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching, whitespace trimming, and composite dedup keys.
| Combine multiple Apify datasets and remove duplicates by URL, email, name + company, or any field combination. Case-insensitive matching and whitespace trimming built in. |
| ParseBird Infra Suite • Utility tools for data pipelines | ||
|
🔗 HTTP Request Send API calls from the cloud |
📚 Data Deduplicator ➤ You are here |
🗡 Data Cleaner Clean nulls, normalize case, format phones & emails |
Copy to your AI assistant
Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.
parsebird/dataset-deduplicator on Apify. Call: ApifyClient("TOKEN").actor("parsebird/dataset-deduplicator").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for deduplicated results. Key inputs: datasetIds (array of strings — Apify dataset IDs to merge), jsonData (array of objects — direct JSON input, alternative to datasetIds), fields (array of strings, required — field names for dedup key), keepOrder (string, "first"/"last", default "first"), caseInsensitive (boolean, default false), trimWhitespace (boolean, default true), outputDatasetId (string, optional), maxItems (integer, default 1000000). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~dataset-deduplicator (Bearer TOKEN). Get token: https://console.apify.com/account/integrations
What does Data Deduplicator do?
This Actor merges one or more Apify datasets and removes duplicate rows based on fields you specify. It's the fastest way to clean up scraped data before analysis or export.
- Single-field dedup — deduplicate by
url,email,phone, or any single field - Composite key dedup — combine multiple fields like
firstName+lastName+companyto identify unique records - Case-insensitive matching — treat "John" and "john" as the same value
- Whitespace trimming — ignore leading/trailing spaces in field values
- First or last — choose whether to keep the first or last occurrence of each duplicate
- Multi-dataset merge — combine items from multiple dataset IDs before deduplication
- Direct JSON input — pass data directly as a JSON array instead of referencing datasets
How to use it (6 steps)
- Run your scraper(s) — collect data into one or more Apify datasets
- Copy the dataset ID(s) — find them in the Apify Console under your run's Storage tab
- Choose your dedup fields — pick the field(s) that uniquely identify each record
- Configure matching — enable case-insensitive or whitespace trimming if needed
- Run this Actor — pass the dataset IDs and field names as input
- Get clean data — deduplicated items appear in the output dataset
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
datasetIds | string[] | No* | — | Apify dataset IDs to merge and deduplicate |
jsonData | array | No* | — | Direct JSON array of objects to deduplicate |
fields | string[] | Yes | — | Field names for the dedup key |
keepOrder | string | No | first | Keep first or last occurrence of duplicates |
caseInsensitive | boolean | No | false | Treat "John" and "john" as the same |
trimWhitespace | boolean | No | true | Strip leading/trailing whitespace before comparing |
outputDatasetId | string | No | — | Named output dataset (defaults to run dataset) |
maxItems | integer | No | 1000000 | Max items to load across all datasets |
*Provide either datasetIds or jsonData (or both).
Composite key examples
| Use case | Fields | Effect |
|---|---|---|
| Unique URLs | ["url"] | One row per URL |
| Unique emails | ["email"] | One row per email address |
| Unique people | ["firstName", "lastName", "company"] | One row per person at each company |
| Unique products | ["sku", "marketplace"] | One row per SKU per marketplace |
Output example
Deduplicated items retain their original structure — no fields are added or removed:
[{"name": "John Doe", "email": "john@example.com", "company": "Acme"},{"name": "Jane Smith", "email": "jane@example.com", "company": "Beta"},{"name": "Bob Wilson", "email": "bob@example.com", "company": "Gamma"}]
A stats key is stored in the key-value store:
{"totalLoaded": 5000,"uniqueKept": 3200,"duplicatesRemoved": 1800,"datasetsProcessed": 3}
How to use via API
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("parsebird/dataset-deduplicator").call(run_input={"datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],"fields": ["email"],"caseInsensitive": True,"trimWhitespace": True,"keepOrder": "first",})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"Unique items: {len(items)}")
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('parsebird/dataset-deduplicator').call({datasetIds: ['DATASET_ID_1', 'DATASET_ID_2'],fields: ['firstName', 'lastName', 'company'],caseInsensitive: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Unique items: ${items.length}`);
cURL
curl -X POST "https://api.apify.com/v2/acts/parsebird~dataset-deduplicator/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"datasetIds": ["DATASET_ID_1"],"fields": ["url"],"keepOrder": "first"}'
Tips and best practices
- Start with a single field —
urloremailusually covers most use cases - Use composite keys carefully — the more fields, the stricter the matching (fewer duplicates found)
- Enable case-insensitive for user-generated data — names, emails, and titles often have inconsistent casing
- Keep whitespace trimming on — scraped data frequently has stray spaces
- Use
keepOrder: lastto keep the most recent version of each duplicate (e.g., from incremental scrapes)
Pricing
This Actor uses a pay-per-event pricing model.
| Event | Price per event | Price per 1,000 |
|---|---|---|
items-processed | $0.00149 | $1.49 |
Charged per 1,000 items loaded (not per unique item). Platform compute costs are additional.