Superclean Dedupe
Pricing
from $0.35 / 1,000 record processeds
Superclean Dedupe
Deduplicate lead records using fuzzy matching. Exact email/phone matching, Jaro-Winkler for names, token-set ratio for companies. Smart field auto-detection. Input: array of record objects. Output: clustered records with isCanonical, matchScore, matchReasons. Batch or instant pair API.
Pricing
from $0.35 / 1,000 record processeds
Rating
0.0
(0)
Developer

Superlative
Actor stats
2
Bookmarked
2
Total users
2
Monthly active users
4 days ago
Last modified
Categories
Share
Deduplicate lead records using fuzzy matching. Find and merge duplicates across company names, person names, emails, phone numbers, and more.
What does Superclean Dedupe do?
Superclean Dedupe finds duplicate records in your lead data using multi-field fuzzy matching — not just exact string comparison.
- Exact matching — Normalized email, phone, and domain matching catches obvious duplicates
- Fuzzy name matching — Jaro-Winkler algorithm matches "Jim Smith" with "James Smith", "McDonald" with "Mcdonald"
- Fuzzy company matching — Token-set ratio matches "ACME Corp" with "Acme Corporation", handles word-order differences
- Smart auto-detection — Automatically detects which fields to match on (email, name, company, phone, domain, etc.)
- Transitive clustering — If A matches B and B matches C, all three are grouped together
- Canonical selection — Picks the most complete record as the "winner" in each duplicate cluster
- Instant pair comparison — Sub-second comparison of two records via Standby HTTP API
Use with AI Agents
Available as an MCP tool via the Apify MCP Server.
| Property | Value |
|---|---|
| Actor ID | superlativetech/superclean-dedupe |
| Standby URL | https://superlativetech--superclean-dedupe.apify.actor |
| Input | records (object[]) or a + b (objects, Standby pair comparison) |
| Output | {id, record, clusterId, clusterSize, isCanonical, duplicateOf, matchScore, matchReasons, confidence} per record |
| Pricing | $0.50 per 1,000 records |
| Idempotent | Yes — same input always produces same output |
Output schema
{ "id": 1, "record": {"name": "Jim Smith", "email": "jim@acme.com"}, "clusterId": "c_1", "clusterSize": 2, "isCanonical": true, "duplicateOf": null, "matchScore": null, "matchReasons": [], "confidence": 1 }
Pipeline composability
This actor works in lead enrichment pipelines:
- Scrape → 2. Clean (Superclean actors) → 3. Dedupe (this actor) → 4. Enrich (DNS/WHOIS) → 5. Score (ICP Scorer)
Pro tip: Clean names and companies with Superclean actors first, then dedupe. Normalized values produce much better match results.
Standby (instant pair comparison)
GET https://superlativetech--superclean-dedupe.apify.actor?token=TOKEN&a={"name":"Jim Smith","email":"jim@acme.com"}&b={"name":"James Smith","email":"jim@acme.com"}
What else can Superclean do?
If you're cleaning lead data, you might also need:
- Superclean Company Names — Clean messy company names for cold emails and CRM
- Superclean Person Names — Clean person names for cold email personalization
- Superclean Emails — Validate emails, fix typos, detect disposable providers
- Superclean Job Titles — Normalize job titles for lead scoring and personalization
- Superclean Phone Numbers — Format and validate phone numbers
- Superclean URLs — Clean and normalize URLs from lead data
- Superclean Places — Parse and normalize location data from lead exports
- Superlead ICP Scorer — Score leads against your Ideal Customer Profile with AI
Why deduplicate lead data?
Your lead data comes from many sources with duplicate problems:
- Same person scraped from LinkedIn, Apollo, and a website — three records
- "ACME Corp" from one source, "Acme Corporation, Inc." from another
- "jim@acme.com" and "jim+linkedin@acme.com" — same person, different emails
- CRM imports from multiple campaigns with overlapping prospects
Duplicate records mean:
- Wasted outreach — Same prospect gets the same email twice
- Inflated metrics — Pipeline looks bigger than it is
- CRM pollution — Duplicate contacts create confusion for sales reps
- Higher costs — Enrichment and scoring costs multiply with duplicates
How to use Superclean Dedupe
- Paste your lead records as JSON into the input field
- Click Start and download your deduplicated results
- Filter to
isCanonical === trueto get the deduplicated set
Each record is compared against potential matches using multi-field fuzzy scoring, then grouped into clusters of duplicates.
How matching works
Smart field auto-detection
If you don't specify matchFields, the actor automatically detects fields from your records:
| Field | Method | Weight | Notes |
|---|---|---|---|
email | Exact (normalized) | 1.0 | Strips plus addressing, Gmail dots |
phone | Exact (digits only) | 0.9 | Strips all formatting |
name / fullName | Jaro-Winkler | 0.8 | Best for short strings |
firstName / lastName | Jaro-Winkler | 0.5 each | Combined when both present |
company / companyName | Token-set ratio | 0.7 | Handles word-order differences |
domain / website | Exact (extracted) | 0.7 | Ignores free email domains |
city | Exact | 0.3 | Supplementary signal |
title / jobTitle | Token-set ratio | 0.3 | Weak signal, helps disambiguation |
Custom match fields
Override auto-detection with explicit field configuration:
{"matchFields": [{ "field": "email", "method": "exact", "weight": 1.0 },{ "field": "name", "method": "jaro-winkler", "weight": 0.8 },{ "field": "company", "method": "token-set", "weight": 0.6 }]}
Available methods: exact, jaro-winkler, levenshtein, token-set
Threshold
The threshold parameter (default 0.85) controls how similar two records must be to be considered duplicates:
- 0.95 — Very strict. Only near-identical records merge. Minimizes false positives.
- 0.85 — Balanced default. Catches most real duplicates with few false positives.
- 0.70 — Aggressive. Catches more duplicates but may over-merge.
- 0.50 — Very aggressive. Only use for exploration or when you'll manually review.
When in doubt, start strict (0.90) and lower if you're missing duplicates.
How many records can you deduplicate?
The actor handles up to 100,000 records per run. Multi-pass blocking avoids the O(n^2) comparison problem — for 10,000 records, only ~100K-300K comparisons are needed instead of 50 million.
For best performance, clean your data with Superclean actors first. Normalized names and companies produce dramatically better match results.
How much will it cost you?
| Records | Cost |
|---|---|
| 1,000 | $0.50 |
| 10,000 | $5.00 |
| 100,000 | $50.00 |
Volume discounts apply automatically:
- Bronze (100+ records): $0.45/1K
- Silver (1,000+ records): $0.40/1K
- Gold (10,000+ records): $0.35/1K
Input parameters
| Parameter | Type | Description |
|---|---|---|
records | array | Array of record objects to deduplicate |
record | object | Single record — API shorthand. If both record and records are provided, record is prepended to the list |
threshold | number | Similarity threshold for dedup (0-1, default 0.85) |
matchFields | array | Optional field configuration (auto-detected if omitted) |
Input example
{"records": [{ "name": "Jim Smith", "email": "jim@acme.com", "company": "ACME Corp" },{ "name": "James Smith", "email": "jim@acme.com", "company": "Acme Corporation" },{ "name": "Jane Doe", "email": "jane@example.com", "company": "Example Inc" },{ "name": "J. Doe", "email": "jane@example.com", "company": "Example" },{ "name": "Bob Wilson", "email": "bob@other.com", "company": "Other LLC" }]}
During the Actor run
The Actor processes records in four stages:
- Auto-detect — Scan record keys and select matching fields/methods
- Block — Multi-pass blocking groups candidate pairs (avoids comparing every pair)
- Score — Weighted multi-field similarity scoring for each candidate pair
- Cluster — Union-Find groups transitive matches; select canonical records
You'll see progress logs as records are processed. Results are available in real-time.
Output format
Results are saved to the default dataset. Each input record appears in the output, annotated with its cluster membership.
Output example
[{"id": 1,"record": { "name": "Jim Smith", "email": "jim@acme.com", "company": "ACME Corp" },"clusterId": "c_1","clusterSize": 2,"isCanonical": true,"duplicateOf": null,"matchScore": null,"matchReasons": [],"confidence": 1},{"id": 2,"record": { "name": "James Smith", "email": "jim@acme.com", "company": "Acme Corporation" },"clusterId": "c_1","clusterSize": 2,"isCanonical": false,"duplicateOf": 1,"matchScore": 0.92,"matchReasons": ["email_exact", "name_jaro-winkler:0.87", "company_token-set:0.91"],"confidence": 0.92},{"id": 5,"record": { "name": "Bob Wilson", "email": "bob@other.com", "company": "Other LLC" },"clusterId": "c_3","clusterSize": 1,"isCanonical": true,"duplicateOf": null,"matchScore": null,"matchReasons": [],"confidence": 1}]
| Field | Description |
|---|---|
id | Row number (1-based, matches input order) |
record | Original input record object |
clusterId | Cluster identifier — records in the same cluster are duplicates |
clusterSize | Number of records in this cluster (1 = unique, >1 = has duplicates) |
isCanonical | True if this is the "best" (most complete) record in the cluster |
duplicateOf | ID of the canonical record this is a duplicate of (null if canonical) |
matchScore | Composite similarity score against the canonical record (0-1) |
matchReasons | Which fields matched and their individual scores |
confidence | Overall confidence in the duplicate match (0-1) |
How to get the deduplicated set
Filter the output to isCanonical === true:
const deduped = results.filter(r => r.isCanonical);
Standby mode (instant pair comparison)
Standby mode provides instant comparison of two records without running a full batch. Ideal for real-time dedup in agent workflows, Clay enrichment steps, and form validation.
Standby URL
https://superlativetech--superclean-dedupe.apify.actor?token=YOUR_API_TOKEN
Compare two records
$curl "https://superlativetech--superclean-dedupe.apify.actor?token=TOKEN&a=%7B%22name%22%3A%22Jim+Smith%22%2C%22email%22%3A%22jim%40acme.com%22%7D&b=%7B%22name%22%3A%22James+Smith%22%2C%22email%22%3A%22jim%40acme.com%22%7D"
Query parameters
| Parameter | Required | Description |
|---|---|---|
a | Yes | First record as URL-encoded JSON |
b | Yes | Second record as URL-encoded JSON |
threshold | No | Similarity threshold (default 0.85) |
Response format
{"isDuplicate": true,"score": 0.92,"reasons": ["email_exact", "name_jaro-winkler:0.87"],"confidence": 0.92,"fieldsUsed": ["email", "name", "company"]}
Limitations
- Does not perform cross-run dedup (each run is independent)
- Fuzzy matching works best on English text; non-Latin scripts may produce lower accuracy
- Maximum ~100,000 records per run (256 MB memory constraint)
- Auto-detection requires standard field names (email, name, company, phone, etc.)
- Custom field names need explicit
matchFieldsconfiguration
Integrations
Superclean Dedupe works with any tool that can call Apify Actors:
- Clay — Add as a dedup step after enrichment
- Make — Use the Apify module to run the Actor
- Zapier — Trigger runs and retrieve results automatically
- n8n — Self-hosted workflow automation
Using Superclean Dedupe with the Apify API
Node.js:
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('superlativetech/superclean-dedupe').call({records: [{ name: 'Jim Smith', email: 'jim@acme.com', company: 'ACME Corp' },{ name: 'James Smith', email: 'jim@acme.com', company: 'Acme Corporation' },{ name: 'Bob Wilson', email: 'bob@other.com', company: 'Other LLC' },]});const { items } = await client.dataset(run.defaultDatasetId).listItems();const deduped = items.filter(r => r.isCanonical);console.log(`${items.length} records → ${deduped.length} unique`);
Python:
from apify_client import ApifyClientclient = ApifyClient('YOUR_API_TOKEN')run = client.actor('superlativetech/superclean-dedupe').call(run_input={'records': [{'name': 'Jim Smith', 'email': 'jim@acme.com', 'company': 'ACME Corp'},{'name': 'James Smith', 'email': 'jim@acme.com', 'company': 'Acme Corporation'},{'name': 'Bob Wilson', 'email': 'bob@other.com', 'company': 'Other LLC'},]})items = client.dataset(run['defaultDatasetId']).list_items().itemsdeduped = [r for r in items if r['isCanonical']]print(f"{len(items)} records → {len(deduped)} unique")
Check out the Apify API reference for full details, or click the API tab above for more code examples.
Your feedback
We're always improving Superclean Actors. If you have feature requests, find a bug, or need help with a specific use case, please open an issue in the Actor's Issues tab.
Leave a review
If Superclean Dedupe saves you time or improves your lead data, please leave a review. Your feedback helps other users discover the tool and helps us understand what's working well.
Built by Superlative