Superclean Dedupe avatar

Superclean Dedupe

Pricing

from $0.35 / 1,000 record processeds

Go to Apify Store
Superclean Dedupe

Superclean Dedupe

Deduplicate lead records using fuzzy matching. Exact email/phone matching, Jaro-Winkler for names, token-set ratio for companies. Smart field auto-detection. Input: array of record objects. Output: clustered records with isCanonical, matchScore, matchReasons. Batch or instant pair API.

Pricing

from $0.35 / 1,000 record processeds

Rating

0.0

(0)

Developer

Superlative

Superlative

Maintained by Community

Actor stats

2

Bookmarked

2

Total users

2

Monthly active users

4 days ago

Last modified

Share

Deduplicate lead records using fuzzy matching. Find and merge duplicates across company names, person names, emails, phone numbers, and more.

What does Superclean Dedupe do?

Superclean Dedupe finds duplicate records in your lead data using multi-field fuzzy matching — not just exact string comparison.

  • Exact matching — Normalized email, phone, and domain matching catches obvious duplicates
  • Fuzzy name matching — Jaro-Winkler algorithm matches "Jim Smith" with "James Smith", "McDonald" with "Mcdonald"
  • Fuzzy company matching — Token-set ratio matches "ACME Corp" with "Acme Corporation", handles word-order differences
  • Smart auto-detection — Automatically detects which fields to match on (email, name, company, phone, domain, etc.)
  • Transitive clustering — If A matches B and B matches C, all three are grouped together
  • Canonical selection — Picks the most complete record as the "winner" in each duplicate cluster
  • Instant pair comparison — Sub-second comparison of two records via Standby HTTP API

Use with AI Agents

Available as an MCP tool via the Apify MCP Server.

PropertyValue
Actor IDsuperlativetech/superclean-dedupe
Standby URLhttps://superlativetech--superclean-dedupe.apify.actor
Inputrecords (object[]) or a + b (objects, Standby pair comparison)
Output{id, record, clusterId, clusterSize, isCanonical, duplicateOf, matchScore, matchReasons, confidence} per record
Pricing$0.50 per 1,000 records
IdempotentYes — same input always produces same output

Output schema

{ "id": 1, "record": {"name": "Jim Smith", "email": "jim@acme.com"}, "clusterId": "c_1", "clusterSize": 2, "isCanonical": true, "duplicateOf": null, "matchScore": null, "matchReasons": [], "confidence": 1 }

Pipeline composability

This actor works in lead enrichment pipelines:

  1. Scrape → 2. Clean (Superclean actors) → 3. Dedupe (this actor) → 4. Enrich (DNS/WHOIS) → 5. Score (ICP Scorer)

Pro tip: Clean names and companies with Superclean actors first, then dedupe. Normalized values produce much better match results.

Standby (instant pair comparison)

GET https://superlativetech--superclean-dedupe.apify.actor?token=TOKEN&a={"name":"Jim Smith","email":"jim@acme.com"}&b={"name":"James Smith","email":"jim@acme.com"}

What else can Superclean do?

If you're cleaning lead data, you might also need:

Why deduplicate lead data?

Your lead data comes from many sources with duplicate problems:

  • Same person scraped from LinkedIn, Apollo, and a website — three records
  • "ACME Corp" from one source, "Acme Corporation, Inc." from another
  • "jim@acme.com" and "jim+linkedin@acme.com" — same person, different emails
  • CRM imports from multiple campaigns with overlapping prospects

Duplicate records mean:

  • Wasted outreach — Same prospect gets the same email twice
  • Inflated metrics — Pipeline looks bigger than it is
  • CRM pollution — Duplicate contacts create confusion for sales reps
  • Higher costs — Enrichment and scoring costs multiply with duplicates

How to use Superclean Dedupe

  1. Paste your lead records as JSON into the input field
  2. Click Start and download your deduplicated results
  3. Filter to isCanonical === true to get the deduplicated set

Each record is compared against potential matches using multi-field fuzzy scoring, then grouped into clusters of duplicates.

How matching works

Smart field auto-detection

If you don't specify matchFields, the actor automatically detects fields from your records:

FieldMethodWeightNotes
emailExact (normalized)1.0Strips plus addressing, Gmail dots
phoneExact (digits only)0.9Strips all formatting
name / fullNameJaro-Winkler0.8Best for short strings
firstName / lastNameJaro-Winkler0.5 eachCombined when both present
company / companyNameToken-set ratio0.7Handles word-order differences
domain / websiteExact (extracted)0.7Ignores free email domains
cityExact0.3Supplementary signal
title / jobTitleToken-set ratio0.3Weak signal, helps disambiguation

Custom match fields

Override auto-detection with explicit field configuration:

{
"matchFields": [
{ "field": "email", "method": "exact", "weight": 1.0 },
{ "field": "name", "method": "jaro-winkler", "weight": 0.8 },
{ "field": "company", "method": "token-set", "weight": 0.6 }
]
}

Available methods: exact, jaro-winkler, levenshtein, token-set

Threshold

The threshold parameter (default 0.85) controls how similar two records must be to be considered duplicates:

  • 0.95 — Very strict. Only near-identical records merge. Minimizes false positives.
  • 0.85 — Balanced default. Catches most real duplicates with few false positives.
  • 0.70 — Aggressive. Catches more duplicates but may over-merge.
  • 0.50 — Very aggressive. Only use for exploration or when you'll manually review.

When in doubt, start strict (0.90) and lower if you're missing duplicates.

How many records can you deduplicate?

The actor handles up to 100,000 records per run. Multi-pass blocking avoids the O(n^2) comparison problem — for 10,000 records, only ~100K-300K comparisons are needed instead of 50 million.

For best performance, clean your data with Superclean actors first. Normalized names and companies produce dramatically better match results.

How much will it cost you?

RecordsCost
1,000$0.50
10,000$5.00
100,000$50.00

Volume discounts apply automatically:

  • Bronze (100+ records): $0.45/1K
  • Silver (1,000+ records): $0.40/1K
  • Gold (10,000+ records): $0.35/1K

Input parameters

ParameterTypeDescription
recordsarrayArray of record objects to deduplicate
recordobjectSingle record — API shorthand. If both record and records are provided, record is prepended to the list
thresholdnumberSimilarity threshold for dedup (0-1, default 0.85)
matchFieldsarrayOptional field configuration (auto-detected if omitted)

Input example

{
"records": [
{ "name": "Jim Smith", "email": "jim@acme.com", "company": "ACME Corp" },
{ "name": "James Smith", "email": "jim@acme.com", "company": "Acme Corporation" },
{ "name": "Jane Doe", "email": "jane@example.com", "company": "Example Inc" },
{ "name": "J. Doe", "email": "jane@example.com", "company": "Example" },
{ "name": "Bob Wilson", "email": "bob@other.com", "company": "Other LLC" }
]
}

During the Actor run

The Actor processes records in four stages:

  1. Auto-detect — Scan record keys and select matching fields/methods
  2. Block — Multi-pass blocking groups candidate pairs (avoids comparing every pair)
  3. Score — Weighted multi-field similarity scoring for each candidate pair
  4. Cluster — Union-Find groups transitive matches; select canonical records

You'll see progress logs as records are processed. Results are available in real-time.

Output format

Results are saved to the default dataset. Each input record appears in the output, annotated with its cluster membership.

Output example

[
{
"id": 1,
"record": { "name": "Jim Smith", "email": "jim@acme.com", "company": "ACME Corp" },
"clusterId": "c_1",
"clusterSize": 2,
"isCanonical": true,
"duplicateOf": null,
"matchScore": null,
"matchReasons": [],
"confidence": 1
},
{
"id": 2,
"record": { "name": "James Smith", "email": "jim@acme.com", "company": "Acme Corporation" },
"clusterId": "c_1",
"clusterSize": 2,
"isCanonical": false,
"duplicateOf": 1,
"matchScore": 0.92,
"matchReasons": ["email_exact", "name_jaro-winkler:0.87", "company_token-set:0.91"],
"confidence": 0.92
},
{
"id": 5,
"record": { "name": "Bob Wilson", "email": "bob@other.com", "company": "Other LLC" },
"clusterId": "c_3",
"clusterSize": 1,
"isCanonical": true,
"duplicateOf": null,
"matchScore": null,
"matchReasons": [],
"confidence": 1
}
]
FieldDescription
idRow number (1-based, matches input order)
recordOriginal input record object
clusterIdCluster identifier — records in the same cluster are duplicates
clusterSizeNumber of records in this cluster (1 = unique, >1 = has duplicates)
isCanonicalTrue if this is the "best" (most complete) record in the cluster
duplicateOfID of the canonical record this is a duplicate of (null if canonical)
matchScoreComposite similarity score against the canonical record (0-1)
matchReasonsWhich fields matched and their individual scores
confidenceOverall confidence in the duplicate match (0-1)

How to get the deduplicated set

Filter the output to isCanonical === true:

const deduped = results.filter(r => r.isCanonical);

Standby mode (instant pair comparison)

Standby mode provides instant comparison of two records without running a full batch. Ideal for real-time dedup in agent workflows, Clay enrichment steps, and form validation.

Standby URL

https://superlativetech--superclean-dedupe.apify.actor?token=YOUR_API_TOKEN

Compare two records

$curl "https://superlativetech--superclean-dedupe.apify.actor?token=TOKEN&a=%7B%22name%22%3A%22Jim+Smith%22%2C%22email%22%3A%22jim%40acme.com%22%7D&b=%7B%22name%22%3A%22James+Smith%22%2C%22email%22%3A%22jim%40acme.com%22%7D"

Query parameters

ParameterRequiredDescription
aYesFirst record as URL-encoded JSON
bYesSecond record as URL-encoded JSON
thresholdNoSimilarity threshold (default 0.85)

Response format

{
"isDuplicate": true,
"score": 0.92,
"reasons": ["email_exact", "name_jaro-winkler:0.87"],
"confidence": 0.92,
"fieldsUsed": ["email", "name", "company"]
}

Limitations

  • Does not perform cross-run dedup (each run is independent)
  • Fuzzy matching works best on English text; non-Latin scripts may produce lower accuracy
  • Maximum ~100,000 records per run (256 MB memory constraint)
  • Auto-detection requires standard field names (email, name, company, phone, etc.)
  • Custom field names need explicit matchFields configuration

Integrations

Superclean Dedupe works with any tool that can call Apify Actors:

  • Clay — Add as a dedup step after enrichment
  • Make — Use the Apify module to run the Actor
  • Zapier — Trigger runs and retrieve results automatically
  • n8n — Self-hosted workflow automation

Using Superclean Dedupe with the Apify API

Node.js:

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('superlativetech/superclean-dedupe').call({
records: [
{ name: 'Jim Smith', email: 'jim@acme.com', company: 'ACME Corp' },
{ name: 'James Smith', email: 'jim@acme.com', company: 'Acme Corporation' },
{ name: 'Bob Wilson', email: 'bob@other.com', company: 'Other LLC' },
]
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
const deduped = items.filter(r => r.isCanonical);
console.log(`${items.length} records → ${deduped.length} unique`);

Python:

from apify_client import ApifyClient
client = ApifyClient('YOUR_API_TOKEN')
run = client.actor('superlativetech/superclean-dedupe').call(run_input={
'records': [
{'name': 'Jim Smith', 'email': 'jim@acme.com', 'company': 'ACME Corp'},
{'name': 'James Smith', 'email': 'jim@acme.com', 'company': 'Acme Corporation'},
{'name': 'Bob Wilson', 'email': 'bob@other.com', 'company': 'Other LLC'},
]
})
items = client.dataset(run['defaultDatasetId']).list_items().items
deduped = [r for r in items if r['isCanonical']]
print(f"{len(items)} records → {len(deduped)} unique")

Check out the Apify API reference for full details, or click the API tab above for more code examples.

Your feedback

We're always improving Superclean Actors. If you have feature requests, find a bug, or need help with a specific use case, please open an issue in the Actor's Issues tab.

Leave a review

If Superclean Dedupe saves you time or improves your lead data, please leave a review. Your feedback helps other users discover the tool and helps us understand what's working well.


Built by Superlative