# Lead List Deduplicator & Normalizer (`webdata_labs/lead-list-deduplicator`) Actor

\[💵 $0.05 / 1K] Clean messy B2B lead lists into CRM-ready company/contact records with duplicate clusters, confidence scores, match reasons, normalized domains, emails, and phones.

- **URL**: https://apify.com/webdata\_labs/lead-list-deduplicator.md
- **Developed by:** [Open Web Team](https://apify.com/webdata_labs) (community)
- **Categories:** Lead generation, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.05 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Lead List Deduplicator & Normalizer - CRM-Ready Leads, Not Messy Dumps

Turn messy scraped B2B lead lists into canonical, CRM-ready company and contact records, not duplicate-filled dumps.

This Actor takes inline JSON records or an Apify dataset ID, normalizes common lead fields, groups duplicates, and outputs one canonical row per lead/company cluster with confidence scores, match reasons, source row IDs, and warnings.

Use it after Google Maps scrapers, directory scrapers, website contact scrapers, exhibitor-list scrapers, Apollo-style lead exports, or any workflow where several sources produce overlapping leads.

### Launch Pricing

Launch pricing is currently **$0.05 per 1,000 cleaned rows**.

The launch version supports up to **5,000 input records per run**. Larger datasets can be processed in batches while high-volume matching is optimized.

### Quick Preview

| Messy input | Clean output |
|---|---|
| `Acme Inc`, `ACME LLC`, `https://www.acme.com`, `sales@acme.com` | one canonical `acme.com` cluster |
| duplicate domains, emails, phones, or similar company names | `clusterId`, `clusterSize`, `mergeConfidence`, `matchReasons` |
| records from multiple Apify datasets or CSV/JSON imports | CRM-ready rows with normalized company, domain, email, and phone |

### Why Use This Actor

- Merge overlapping exports from multiple scrapers.
- Remove duplicate companies, domains, emails, and phone numbers before CRM import.
- Normalize company names, domains, emails, and phones.
- Keep source row IDs so every merge is auditable.
- Get confidence scores and match reasons instead of a black-box cleanup.
- Use deterministic rules first, so costs stay predictable.
- No browser, proxies, or external enrichment APIs.

### Common Use Cases

- Merge lead lists from several Apify scrapers.
- Clean a CSV before importing it into HubSpot, Pipedrive, Salesforce, Clay, Instantly, Smartlead, or Airtable.
- Remove duplicate outreach targets before spending credits on email verification or enrichment.
- Create a canonical company list from multiple scraped directories.
- Audit which rows were merged and why.

### Input Example

```json
{
  "dedupMode": "balanced",
  "records": [
    {
      "id": "1",
      "company": "Acme Inc",
      "website": "https://www.acme.com",
      "email": "sales@acme.com"
    },
    {
      "id": "2",
      "companyName": "ACME LLC",
      "domain": "acme.com",
      "phone": "(415) 555-2671"
    }
  ]
}
````

You can also provide an Apify `datasetId` instead of inline `records`.

If no input is provided, the Actor runs with sample records so you can test the output immediately.

### Output Example

```json
{
  "recordType": "canonicalLead",
  "clusterId": "cluster_0001",
  "clusterSize": 2,
  "mergeDecision": "merged",
  "mergeConfidence": 0.9,
  "matchReasons": ["same_domain", "similar_company"],
  "sourceRowIds": ["1", "2"],
  "canonicalCompanyName": "Acme Inc",
  "normalizedCompanyName": "acme",
  "normalizedDomain": "acme.com",
  "normalizedEmail": "sales@acme.com",
  "normalizedPhone": "4155552671",
  "warnings": []
}
```

### Deduplication Modes

| Mode | Best for | Behavior |
|---|---|---|
| `conservative` | Avoiding false merges | Requires exact email, phone, or domain match |
| `balanced` | Most lead lists | Uses exact email/phone/domain plus strong company-name similarity |
| `aggressive` | Very messy lists | Uses looser company-name matching; review warnings before importing |

### Dataset Views

| View | Best for |
|---|---|
| `Canonical` | CRM-ready rows after deduplication |
| `Duplicate clusters` | Auditing source rows, match reasons, and confidence |

### Output Fields

| Field | Meaning |
|---|---|
| `clusterId` | Stable cluster identifier for the canonical row |
| `clusterSize` | Number of source rows merged into the canonical row |
| `mergeDecision` | `unique`, `merged`, or `ambiguous` |
| `mergeConfidence` | Confidence score from 0 to 1 |
| `matchReasons` | Why records matched, such as `same_email`, `same_domain`, or `similar_company` |
| `sourceRowIds` | Original row IDs or indexes used in the merge |
| `normalizedDomain` | Clean domain value such as `acme.com` |
| `warnings` | Flags such as `low_confidence_merge` or `missing_domain_or_email` |

### Limits and Caveats

- This MVP uses deterministic rules and fuzzy string similarity, not paid LLM adjudication.
- Review `ambiguous` rows before importing them into a CRM.
- Email/phone/domain normalization is conservative and may not cover every country-specific format.
- The Actor does not scrape or enrich missing contact data; it cleans the records you provide.
- It does not verify email deliverability or MX records in the first version.
- Current runs are capped at 5,000 input records while the deduplication engine is optimized for larger files.

### Pricing

This Actor is designed for pay-per-row pricing. You pay for cleaned output rows plus Apify platform usage.

Because it does not launch a browser or call external enrichment APIs, runs should stay inexpensive for bulk cleanup.

# Actor input Schema

## `records` (type: `array`):

Optional inline lead/company records. Use this for quick tests or small lists.

## `datasetId` (type: `string`):

Optional Apify dataset ID to read records from. If provided, records are loaded from this dataset.

## `dedupMode` (type: `string`):

Conservative prefers exact matches. Balanced also groups likely company/domain/name matches. Aggressive uses looser fuzzy matching.

## `companyFields` (type: `array`):

Field names to inspect for company names.

## `domainFields` (type: `array`):

Field names to inspect for website or domain values.

## `emailFields` (type: `array`):

Field names to inspect for email values.

## `phoneFields` (type: `array`):

Field names to inspect for phone values.

## `includeOriginalRecord` (type: `boolean`):

Include the selected canonical source record in each output row.

## `maxRecords` (type: `integer`):

Maximum records to process from inline input or dataset.

## Actor input object example

```json
{
  "records": [
    {
      "id": "sample-1",
      "company": "Acme Inc",
      "website": "https://www.acme.com",
      "email": "sales@acme.com"
    },
    {
      "id": "sample-2",
      "companyName": "ACME LLC",
      "domain": "acme.com",
      "phone": "(415) 555-2671"
    },
    {
      "id": "sample-3",
      "company": "Beta Labs",
      "website": "betalabs.io",
      "email": "hello@betalabs.io"
    }
  ],
  "dedupMode": "balanced",
  "companyFields": [
    "company",
    "companyName",
    "organization",
    "organizationName",
    "businessName",
    "name"
  ],
  "domainFields": [
    "domain",
    "website",
    "url",
    "companyWebsite",
    "organizationDomain"
  ],
  "emailFields": [
    "email",
    "businessEmail",
    "workEmail",
    "contactEmail"
  ],
  "phoneFields": [
    "phone",
    "phoneNumber",
    "mobile",
    "businessPhone"
  ],
  "includeOriginalRecord": false,
  "maxRecords": 5000
}
```

# Actor output Schema

## `canonical` (type: `string`):

No description

## `duplicates` (type: `string`):

No description

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("webdata_labs/lead-list-deduplicator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("webdata_labs/lead-list-deduplicator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call webdata_labs/lead-list-deduplicator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=webdata_labs/lead-list-deduplicator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Lead List Deduplicator & Normalizer",
        "description": "[💵 $0.05 / 1K] Clean messy B2B lead lists into CRM-ready company/contact records with duplicate clusters, confidence scores, match reasons, normalized domains, emails, and phones.",
        "version": "0.1",
        "x-build-id": "8bnOC3cG1slpl8kEQ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/webdata_labs~lead-list-deduplicator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-webdata_labs-lead-list-deduplicator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/webdata_labs~lead-list-deduplicator/runs": {
            "post": {
                "operationId": "runs-sync-webdata_labs-lead-list-deduplicator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/webdata_labs~lead-list-deduplicator/run-sync": {
            "post": {
                "operationId": "run-sync-webdata_labs-lead-list-deduplicator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "records": {
                        "title": "Inline records",
                        "type": "array",
                        "description": "Optional inline lead/company records. Use this for quick tests or small lists.",
                        "items": {
                            "type": "object"
                        },
                        "default": [
                            {
                                "id": "sample-1",
                                "company": "Acme Inc",
                                "website": "https://www.acme.com",
                                "email": "sales@acme.com"
                            },
                            {
                                "id": "sample-2",
                                "companyName": "ACME LLC",
                                "domain": "acme.com",
                                "phone": "(415) 555-2671"
                            },
                            {
                                "id": "sample-3",
                                "company": "Beta Labs",
                                "website": "betalabs.io",
                                "email": "hello@betalabs.io"
                            }
                        ]
                    },
                    "datasetId": {
                        "title": "Apify dataset ID",
                        "type": "string",
                        "description": "Optional Apify dataset ID to read records from. If provided, records are loaded from this dataset."
                    },
                    "dedupMode": {
                        "title": "Deduplication strictness",
                        "enum": [
                            "conservative",
                            "balanced",
                            "aggressive"
                        ],
                        "type": "string",
                        "description": "Conservative prefers exact matches. Balanced also groups likely company/domain/name matches. Aggressive uses looser fuzzy matching.",
                        "default": "balanced"
                    },
                    "companyFields": {
                        "title": "Company name fields",
                        "type": "array",
                        "description": "Field names to inspect for company names.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "company",
                            "companyName",
                            "organization",
                            "organizationName",
                            "businessName",
                            "name"
                        ]
                    },
                    "domainFields": {
                        "title": "Domain / website fields",
                        "type": "array",
                        "description": "Field names to inspect for website or domain values.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "domain",
                            "website",
                            "url",
                            "companyWebsite",
                            "organizationDomain"
                        ]
                    },
                    "emailFields": {
                        "title": "Email fields",
                        "type": "array",
                        "description": "Field names to inspect for email values.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "email",
                            "businessEmail",
                            "workEmail",
                            "contactEmail"
                        ]
                    },
                    "phoneFields": {
                        "title": "Phone fields",
                        "type": "array",
                        "description": "Field names to inspect for phone values.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "phone",
                            "phoneNumber",
                            "mobile",
                            "businessPhone"
                        ]
                    },
                    "includeOriginalRecord": {
                        "title": "Include original record",
                        "type": "boolean",
                        "description": "Include the selected canonical source record in each output row.",
                        "default": false
                    },
                    "maxRecords": {
                        "title": "Maximum records",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Maximum records to process from inline input or dataset.",
                        "default": 5000
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
