Pricing

from $2.00 / 1,000 record processeds

Lead List Deduplicator & Merger

Merge and deduplicate lead lists from multiple Apify datasets, CSV files and inline JSON into one clean, outreach-ready list. Pure data processor — no scraping, no proxies, no external APIs.

Pricing

from $2.00 / 1,000 record processeds

Rating

0.0

(0)

Developer

Data Runner

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

Lead List Deduplicator & Merger — Clean & Combine Lead Lists

Merge multiple lead lists into one clean, deduplicated, outreach-ready list — in seconds. Combine the results of several scrapers (Google Maps, Instagram, TikTok, Facebook, YouTube, TripAdvisor, website email extraction and more), remove duplicate leads, and merge partial records into complete contacts. Built for cold email outreach, lead generation agencies, and sales teams who run many scrapers and end up with messy, overlapping data.

This is a pure data processor: no scraping, no browsers, no proxies, no external APIs. It can't be blocked, it can't break from website changes, and it runs fast on the smallest memory tier.

Why deduplicate and merge your lead lists?

When you scrape leads from several sources, the same business or person shows up again and again — once from Google Maps, once from Instagram, once from a website crawl. Sending to a dirty list quietly costs you money and reputation:

Protect your sender reputation & deliverability. Emailing the same contact twice (or hitting old, duplicated addresses) drives spam complaints and bounces. High bounce rates and duplicate sends are two of the fastest ways to wreck a sending domain. A clean, deduplicated list keeps your inbox placement healthy.
Stop wasting outreach credits. Email verification tools, enrichment APIs and sending platforms usually charge per contact. Every duplicate is money spent twice. Deduplicating before you verify or send is the cheapest optimization in your funnel.
Keep your CRM clean. Duplicate and fragmented records pollute your CRM, break reporting, and create awkward "didn't we already talk to them?" moments. Merge first, import once.
Get richer records. One source has the email, another has the phone, a third has the website. Merging fragments field-by-field turns three thin rows into one complete, ready-to-contact lead.

What it does

✅ Merges three input types at once — Apify datasets, public CSV file URLs, and inline JSON. Mix and match freely.
✅ Deduplicates by email, website domain, phone, or name + city — match on any combination of keys.
✅ Smart fuzzy matching across messy data — case-insensitive emails, mailto: stripping, normalized domains (https://www.Acme.com/contact → acme.com), E.164 phone formatting, accent-folding for names.
✅ Connected-component grouping — if record A matches B by email and B matches C by domain, all three collapse into a single lead automatically.
✅ Field-by-field merging — keep the most complete record and fill the gaps from its duplicates. Conflicting values are preserved in a _conflicts field so nothing is silently lost.
✅ Recognizes field variants across scrapers — email / emails[0] / businessEmail, phone / phoneNumber / phones[0], website / url / websiteUrl, name / title / businessName / channelName, and more.
✅ Never crashes on bad data — malformed CSV rows, empty datasets, mixed schemas and missing fields are skipped, counted, and reported. The run keeps going.
✅ Export anywhere — results land in a standard Apify dataset you can download as CSV, JSON, Excel, or HTML.

How it works (3 steps)

Point it at your data. Provide one or more Apify dataset IDs from previous scraper runs, public CSV URLs, and/or paste records as inline JSON.
Choose how duplicates are matched and merged. Pick your dedupe keys (email is the default and safest) and a merge strategy.
Run it. Get back one clean, merged record per unique lead in the output dataset, plus a summary of exactly what was combined and removed.

Input example 1 — Merge previous scraper runs (Apify datasets)

{
  "datasetIds": ["aBcD1234efGh5678", "XyZ9876wVuT5432"],
  "dedupeKeys": ["email", "domain"],
  "mergeStrategy": "most_complete"
}

Input example 2 — Merge CSV files

{
  "csvUrls": [
    "https://example.com/google-maps-leads.csv",
    "https://example.com/instagram-leads.csv"
  ],
  "dedupeKeys": ["email", "phone"],
  "normalizePhones": true,
  "defaultCountry": "US"
}

Input example 3 — Paste records directly (inline JSON)

{
  "inlineRecords": [
    { "businessName": "Acme Inc", "email": "Info@Acme.com", "phone": "(415) 555-2671" },
    { "name": "Acme", "emails": ["info@acme.com"], "website": "https://www.acme.com" }
  ],
  "dedupeKeys": ["email", "domain", "name+city"],
  "mergeStrategy": "most_complete",
  "keepSourceInfo": true
}

💡 You can combine all three sources in a single run. Everything is pooled and deduplicated together.

Input options

Option	Type	Default	Description
`datasetIds`	string[]	`[]`	Apify dataset IDs from previous runs. Invalid IDs are skipped with a warning.
`csvUrls`	string[]	`[]`	Public CSV URLs. Tolerates BOM, `,` `;` tab `\|` delimiters, quoted fields.
`inlineRecords`	object[]	`[]`	Raw records pasted as JSON.
`dedupeKeys`	string[]	`["email"]`	Any of `email`, `domain`, `phone`, `name+city`. Match on ANY selected key.
`mergeStrategy`	string	`most_complete`	`most_complete`, `first_seen`, or `last_seen`.
`normalizePhones`	boolean	`true`	Format output phones as E.164 (e.g. `+14155552671`).
`defaultCountry`	string	`US`	ISO country code for phones without a prefix.
`stripPlusAliases`	boolean	`false`	Treat `john+tag@x.com` and `john@x.com` as the same lead.
`keepSourceInfo`	boolean	`true`	Add a `_sources` array to each merged record.

Output

Each unique lead becomes one merged record in the output dataset. Original fields are preserved; clean canonical name / email / phone / website / city fields are added; merge metadata is prefixed with _.

Example output record

{
  "name": "Acme, Inc.",
  "email": "info@acme.com",
  "phone": "+14155552671",
  "website": "https://www.acme.com/contact",
  "city": "San Francisco",
  "industry": "SaaS",
  "_sources": ["dataset:aBcD1234efGh5678", "csv:example.com/instagram-leads.csv"],
  "_conflicts": { "name": ["Acme HQ"] },
  "_duplicateCount": 3
}

_sources — which dataset(s)/file(s) this lead was merged from.
_conflicts — alternate values that disagreed (e.g. two different phone numbers), so you never lose data.
_duplicateCount — how many input records were combined into this one.

Run summary (key-value store → `OUTPUT`)

{
  "totalInput": 22000,
  "totalOutput": 15840,
  "duplicatesRemoved": 6160,
  "duplicateRate": 0.28,
  "perSourceCounts": { "dataset:aBcD1234efGh5678": 12000, "csv:example.com/instagram-leads.csv": 10000 },
  "malformedSkipped": 12,
  "runtimeSeconds": 9
}

Works perfectly with the rest of the suite

This Actor is the glue that ties your lead-generation stack together. Run any of these scrapers, then pipe their datasets straight into the Lead List Deduplicator & Merger:

Google Maps Lead Generator Pro — local business leads with phone, website and address.
Instagram Email Scraper — creator and business emails from Instagram.
TikTok Email Scraper — contact emails from TikTok profiles.
Facebook Page Lead Scraper — business contact details from Facebook.
YouTube Channel Email Scraper — channel and business emails from YouTube.
TripAdvisor Leads Scraper — hospitality and venue leads.
Website Email Extractor — emails and contact data crawled from any website list.
Email Verifier & Enricher Pro — verify and enrich your final list after deduplicating (dedupe first to save credits!).

Recommended workflow: scrape with several Actors → deduplicate & merge here → verify & enrich → import to your CRM or sending tool.

Pricing

This Actor uses simple, transparent pay-per-event pricing:

$2.00 per 1,000 input records processed ($0.002 per record).

You are charged per input record ingested — the rows you feed in — as they are processed, so partial runs only bill for what was actually handled. Malformed/skipped rows are never charged. There is no per-run start fee. Deduplication typically shrinks your list, so every downstream tool (verification, enrichment, sending) costs less afterwards — this Actor usually pays for itself on the very first run.

FAQ

What counts as a duplicate? Two records are duplicates if they match on any of your selected dedupe keys — same email, same website domain, same phone number, or same name + city. Matching is transitive: if A matches B and B matches C, all three are merged into one lead.

Will it merge john.doe@gmail.com and johndoe@gmail.com? No. By design we do not apply Gmail-style dot-folding, because for business (B2B) addresses those can be different real inboxes. Emails are matched after lowercasing, trimming and mailto: removal only. (You can optionally treat +tag aliases as the same address with stripPlusAliases.)

How does field mapping work across different scrapers? The Actor recognizes common field-name variants automatically — for example email / emails[0] / businessEmail, phone / phoneNumber / phones[0], website / url / websiteUrl, and name / title / businessName / channelName. Unknown fields are passed through to the output untouched, so nothing is lost.

What happens to conflicting values when records are merged? The winning value (per your merge strategy) becomes the field value, and any differing alternates are stored in a _conflicts object on the record. You keep full visibility into every value that was seen.

Is there a file size or record limit? The Actor streams data in batches and keeps only one record per unique lead in memory, so it scales to large lists. For very large jobs (hundreds of thousands of records) simply bump the memory in the run options. CSV inputs are fetched whole, so for massive files prefer pushing them through an Apify dataset.

What if a dataset ID is wrong or a CSV fails to download? The Actor warns and continues with the remaining sources. It only fails the run if zero valid records were found across everything — with a clear, actionable message.

What happens on a partial run (timeout, stop, charge limit)? Records are processed and billed incrementally, and output is written at the end of processing, so a partial run bills only for the records it actually handled.

Do I need proxies or any special setup? No. This is a pure data tool — no scraping, no proxies, no anti-bot concerns. Point it at your data and run.

Get started

Add this Actor to your account.
Paste your dataset IDs, CSV URLs, or inline records.
Pick your dedupe keys and merge strategy.
Run, then download your clean list as CSV, JSON or Excel.

One clean list. No duplicates. Ready for outreach.

Dataset Deduplicator

zentrafoundry/dataset-deduplicator

Deduplicate Apify datasets and create stable merge keys.

Zentra

Deduplicate, Merge & Transform Datasets

datacach/deduplicate-datasets

Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.

DataCach

Phone Number Validator & Formatter

jurassic_jove/phone-number-validator

Validate, classify and format phone numbers in bulk from Apify datasets, CSV files and inline JSON. Offline validation (E.164, line type, country) — no scraping, no carrier APIs.

Data Runner

RSS Feed Merger & Smart Deduplicator

bene123/rss-feed-merger-deduplicator

Merge up to 100 public RSS, Atom, and RDF feeds into one normalized, filtered, deduplicated JSON or CSV dataset without crawling linked articles.

Ben E

Lead Intelligence Scorer

leadops_lab/lead-intelligence-scorer

Clean, deduplicate, score, and prepare B2B leads from any Apify scraper or JSON list.

jiaxun mao

Lead List Deduplicator & Normalizer

webdata_labs/lead-list-deduplicator

[💵 $0.05 / 1K] Clean messy B2B lead lists into CRM-ready company/contact records with duplicate clusters, confidence scores, match reasons, normalized domains, emails, and phones.

WebData Labs

Data Deduplicator

parsebird/dataset-deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

ParseBird

JSON Dataset Cleaner and Deduplicator

rodrgds/dataset-cleaner

Clean JSON datasets, remove empty rows, deduplicate by any field, validate emails, and prepare scraper output for CRMs, analysis, or AI workflows.

Rodrigo Dias

Twitter/X List Members Scraper · No Cookies

data-slayer/twitter-list-members

Extract member profiles from any public Twitter/X list without login. Get usernames, bios, follower/following counts, verification status, locations, websites, and business account info. Build lead lists from curated Twitter lists. No cookies, no API key. JSON/CSV/Excel.

Data Slayer

Dataset Deduplicator

automation-lab/dataset-dedup

Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.