Lead List Deduplicator & Merger avatar

Lead List Deduplicator & Merger

Pricing

from $2.00 / 1,000 record processeds

Go to Apify Store
Lead List Deduplicator & Merger

Lead List Deduplicator & Merger

Merge and deduplicate lead lists from multiple Apify datasets, CSV files and inline JSON into one clean, outreach-ready list. Pure data processor — no scraping, no proxies, no external APIs.

Pricing

from $2.00 / 1,000 record processeds

Rating

0.0

(0)

Developer

Data Runner

Data Runner

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Lead List Deduplicator & Merger — Clean & Combine Lead Lists

Merge multiple lead lists into one clean, deduplicated, outreach-ready list — in seconds. Combine the results of several scrapers (Google Maps, Instagram, TikTok, Facebook, YouTube, TripAdvisor, website email extraction and more), remove duplicate leads, and merge partial records into complete contacts. Built for cold email outreach, lead generation agencies, and sales teams who run many scrapers and end up with messy, overlapping data.

This is a pure data processor: no scraping, no browsers, no proxies, no external APIs. It can't be blocked, it can't break from website changes, and it runs fast on the smallest memory tier.


Why deduplicate and merge your lead lists?

When you scrape leads from several sources, the same business or person shows up again and again — once from Google Maps, once from Instagram, once from a website crawl. Sending to a dirty list quietly costs you money and reputation:

  • Protect your sender reputation & deliverability. Emailing the same contact twice (or hitting old, duplicated addresses) drives spam complaints and bounces. High bounce rates and duplicate sends are two of the fastest ways to wreck a sending domain. A clean, deduplicated list keeps your inbox placement healthy.
  • Stop wasting outreach credits. Email verification tools, enrichment APIs and sending platforms usually charge per contact. Every duplicate is money spent twice. Deduplicating before you verify or send is the cheapest optimization in your funnel.
  • Keep your CRM clean. Duplicate and fragmented records pollute your CRM, break reporting, and create awkward "didn't we already talk to them?" moments. Merge first, import once.
  • Get richer records. One source has the email, another has the phone, a third has the website. Merging fragments field-by-field turns three thin rows into one complete, ready-to-contact lead.

What it does

  • Merges three input types at once — Apify datasets, public CSV file URLs, and inline JSON. Mix and match freely.
  • Deduplicates by email, website domain, phone, or name + city — match on any combination of keys.
  • Smart fuzzy matching across messy data — case-insensitive emails, mailto: stripping, normalized domains (https://www.Acme.com/contactacme.com), E.164 phone formatting, accent-folding for names.
  • Connected-component grouping — if record A matches B by email and B matches C by domain, all three collapse into a single lead automatically.
  • Field-by-field merging — keep the most complete record and fill the gaps from its duplicates. Conflicting values are preserved in a _conflicts field so nothing is silently lost.
  • Recognizes field variants across scrapersemail / emails[0] / businessEmail, phone / phoneNumber / phones[0], website / url / websiteUrl, name / title / businessName / channelName, and more.
  • Never crashes on bad data — malformed CSV rows, empty datasets, mixed schemas and missing fields are skipped, counted, and reported. The run keeps going.
  • Export anywhere — results land in a standard Apify dataset you can download as CSV, JSON, Excel, or HTML.

How it works (3 steps)

  1. Point it at your data. Provide one or more Apify dataset IDs from previous scraper runs, public CSV URLs, and/or paste records as inline JSON.
  2. Choose how duplicates are matched and merged. Pick your dedupe keys (email is the default and safest) and a merge strategy.
  3. Run it. Get back one clean, merged record per unique lead in the output dataset, plus a summary of exactly what was combined and removed.

Input example 1 — Merge previous scraper runs (Apify datasets)

{
"datasetIds": ["aBcD1234efGh5678", "XyZ9876wVuT5432"],
"dedupeKeys": ["email", "domain"],
"mergeStrategy": "most_complete"
}

Input example 2 — Merge CSV files

{
"csvUrls": [
"https://example.com/google-maps-leads.csv",
"https://example.com/instagram-leads.csv"
],
"dedupeKeys": ["email", "phone"],
"normalizePhones": true,
"defaultCountry": "US"
}

Input example 3 — Paste records directly (inline JSON)

{
"inlineRecords": [
{ "businessName": "Acme Inc", "email": "Info@Acme.com", "phone": "(415) 555-2671" },
{ "name": "Acme", "emails": ["info@acme.com"], "website": "https://www.acme.com" }
],
"dedupeKeys": ["email", "domain", "name+city"],
"mergeStrategy": "most_complete",
"keepSourceInfo": true
}

💡 You can combine all three sources in a single run. Everything is pooled and deduplicated together.

Input options

OptionTypeDefaultDescription
datasetIdsstring[][]Apify dataset IDs from previous runs. Invalid IDs are skipped with a warning.
csvUrlsstring[][]Public CSV URLs. Tolerates BOM, , ; tab | delimiters, quoted fields.
inlineRecordsobject[][]Raw records pasted as JSON.
dedupeKeysstring[]["email"]Any of email, domain, phone, name+city. Match on ANY selected key.
mergeStrategystringmost_completemost_complete, first_seen, or last_seen.
normalizePhonesbooleantrueFormat output phones as E.164 (e.g. +14155552671).
defaultCountrystringUSISO country code for phones without a prefix.
stripPlusAliasesbooleanfalseTreat john+tag@x.com and john@x.com as the same lead.
keepSourceInfobooleantrueAdd a _sources array to each merged record.

Output

Each unique lead becomes one merged record in the output dataset. Original fields are preserved; clean canonical name / email / phone / website / city fields are added; merge metadata is prefixed with _.

Example output record

{
"name": "Acme, Inc.",
"email": "info@acme.com",
"phone": "+14155552671",
"website": "https://www.acme.com/contact",
"city": "San Francisco",
"industry": "SaaS",
"_sources": ["dataset:aBcD1234efGh5678", "csv:example.com/instagram-leads.csv"],
"_conflicts": { "name": ["Acme HQ"] },
"_duplicateCount": 3
}
  • _sources — which dataset(s)/file(s) this lead was merged from.
  • _conflicts — alternate values that disagreed (e.g. two different phone numbers), so you never lose data.
  • _duplicateCount — how many input records were combined into this one.

Run summary (key-value store → OUTPUT)

{
"totalInput": 22000,
"totalOutput": 15840,
"duplicatesRemoved": 6160,
"duplicateRate": 0.28,
"perSourceCounts": { "dataset:aBcD1234efGh5678": 12000, "csv:example.com/instagram-leads.csv": 10000 },
"malformedSkipped": 12,
"runtimeSeconds": 9
}

Works perfectly with the rest of the suite

This Actor is the glue that ties your lead-generation stack together. Run any of these scrapers, then pipe their datasets straight into the Lead List Deduplicator & Merger:

  • Google Maps Lead Generator Pro — local business leads with phone, website and address.
  • Instagram Email Scraper — creator and business emails from Instagram.
  • TikTok Email Scraper — contact emails from TikTok profiles.
  • Facebook Page Lead Scraper — business contact details from Facebook.
  • YouTube Channel Email Scraper — channel and business emails from YouTube.
  • TripAdvisor Leads Scraper — hospitality and venue leads.
  • Website Email Extractor — emails and contact data crawled from any website list.
  • Email Verifier & Enricher Pro — verify and enrich your final list after deduplicating (dedupe first to save credits!).

Recommended workflow: scrape with several Actors → deduplicate & merge here → verify & enrich → import to your CRM or sending tool.


Pricing

This Actor uses simple, transparent pay-per-event pricing:

  • $2.00 per 1,000 input records processed ($0.002 per record).

You are charged per input record ingested — the rows you feed in — as they are processed, so partial runs only bill for what was actually handled. Malformed/skipped rows are never charged. There is no per-run start fee. Deduplication typically shrinks your list, so every downstream tool (verification, enrichment, sending) costs less afterwards — this Actor usually pays for itself on the very first run.


FAQ

What counts as a duplicate? Two records are duplicates if they match on any of your selected dedupe keys — same email, same website domain, same phone number, or same name + city. Matching is transitive: if A matches B and B matches C, all three are merged into one lead.

Will it merge john.doe@gmail.com and johndoe@gmail.com? No. By design we do not apply Gmail-style dot-folding, because for business (B2B) addresses those can be different real inboxes. Emails are matched after lowercasing, trimming and mailto: removal only. (You can optionally treat +tag aliases as the same address with stripPlusAliases.)

How does field mapping work across different scrapers? The Actor recognizes common field-name variants automatically — for example email / emails[0] / businessEmail, phone / phoneNumber / phones[0], website / url / websiteUrl, and name / title / businessName / channelName. Unknown fields are passed through to the output untouched, so nothing is lost.

What happens to conflicting values when records are merged? The winning value (per your merge strategy) becomes the field value, and any differing alternates are stored in a _conflicts object on the record. You keep full visibility into every value that was seen.

Is there a file size or record limit? The Actor streams data in batches and keeps only one record per unique lead in memory, so it scales to large lists. For very large jobs (hundreds of thousands of records) simply bump the memory in the run options. CSV inputs are fetched whole, so for massive files prefer pushing them through an Apify dataset.

What if a dataset ID is wrong or a CSV fails to download? The Actor warns and continues with the remaining sources. It only fails the run if zero valid records were found across everything — with a clear, actionable message.

What happens on a partial run (timeout, stop, charge limit)? Records are processed and billed incrementally, and output is written at the end of processing, so a partial run bills only for the records it actually handled.

Do I need proxies or any special setup? No. This is a pure data tool — no scraping, no proxies, no anti-bot concerns. Point it at your data and run.


Get started

  1. Add this Actor to your account.
  2. Paste your dataset IDs, CSV URLs, or inline records.
  3. Pick your dedupe keys and merge strategy.
  4. Run, then download your clean list as CSV, JSON or Excel.

One clean list. No duplicates. Ready for outreach.