Pricing

from $0.35 / 1,000 record processeds

Superclean Dedupe

Deduplicate lead records using fuzzy matching. Exact email/phone matching, Jaro-Winkler for names, token-set ratio for companies. Smart field auto-detection. Input: array of record objects. Output: clustered records with isCanonical, matchScore, matchReasons. Batch or instant pair API.

Pricing

from $0.35 / 1,000 record processeds

Rating

0.0

(0)

Developer

Superlative

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

What does Superclean Dedupe do?

Superclean Dedupe finds duplicate records in your lead data using multi-field fuzzy matching — not just exact string comparison.

Exact matching — Normalized email, phone, and domain matching catches obvious duplicates
Fuzzy name matching — Jaro-Winkler algorithm matches "Jim Smith" with "James Smith", "McDonald" with "Mcdonald"
Fuzzy company matching — Token-set ratio matches "ACME Corp" with "Acme Corporation", handles word-order differences
Smart auto-detection — Automatically detects which fields to match on (email, name, company, phone, domain, etc.)
Transitive clustering — If A matches B and B matches C, all three are grouped together
Canonical selection — Picks the most complete record as the "winner" in each duplicate cluster
Instant pair comparison — Sub-second comparison of two records via Standby HTTP API

Use with AI Agents

Available as an MCP tool via the Apify MCP Server.

Property	Value
Actor ID	`superlativetech/superclean-dedupe`
Standby URL	`https://superlativetech--superclean-dedupe.apify.actor`
Input	`records` (object[]) or `a` + `b` (objects, Standby pair comparison)
Output	`{id, record, clusterId, clusterSize, isCanonical, duplicateOf, matchScore, matchReasons, confidence}` per record
Pricing	$0.50 per 1,000 records
Idempotent	Yes — same input always produces same output

Output schema

{ "id": 1, "record": {"name": "Jim Smith", "email": "jim@acme.com"}, "clusterId": "c_1", "clusterSize": 2, "isCanonical": true, "duplicateOf": null, "matchScore": null, "matchReasons": [], "confidence": 1 }

Pipeline composability

This actor works in lead enrichment pipelines:

Scrape → 2. Clean (Superclean actors) → 3. Dedupe (this actor) → 4. Enrich (DNS/WHOIS) → 5. Score (ICP Scorer)

Pro tip: Clean names and companies with Superclean actors first, then dedupe. Normalized values produce much better match results.

Standby (instant pair comparison)

GET https://superlativetech--superclean-dedupe.apify.actor?token=TOKEN&a={"name":"Jim Smith","email":"jim@acme.com"}&b={"name":"James Smith","email":"jim@acme.com"}

What else can Superclean do?

If you're cleaning lead data, you might also need:

Superclean Company Names — Clean messy company names for cold emails and CRM
Superclean Person Names — Clean person names for cold email personalization
Superclean Emails — Validate emails, fix typos, detect disposable providers
Superclean Job Titles — Normalize job titles for lead scoring and personalization
Superclean Phone Numbers — Format and validate phone numbers
Superclean URLs — Clean and normalize URLs from lead data
Superclean Places — Parse and normalize location data from lead exports
Superlead ICP Scorer — Score leads against your Ideal Customer Profile with AI

Why deduplicate lead data?

Your lead data comes from many sources with duplicate problems:

Same person scraped from LinkedIn, Apollo, and a website — three records
"ACME Corp" from one source, "Acme Corporation, Inc." from another
"jim@acme.com" and "jim+linkedin@acme.com" — same person, different emails
CRM imports from multiple campaigns with overlapping prospects

Duplicate records mean:

Wasted outreach — Same prospect gets the same email twice
Inflated metrics — Pipeline looks bigger than it is
CRM pollution — Duplicate contacts create confusion for sales reps
Higher costs — Enrichment and scoring costs multiply with duplicates

How to use Superclean Dedupe

Paste your lead records as JSON into the input field
Click Start and download your deduplicated results
Filter to isCanonical === true to get the deduplicated set

Each record is compared against potential matches using multi-field fuzzy scoring, then grouped into clusters of duplicates.

How matching works

Smart field auto-detection

If you don't specify matchFields, the actor automatically detects fields from your records:

Field	Method	Weight	Notes
`email`	Exact (normalized)	1.0	Strips plus addressing, Gmail dots
`phone`	Exact (digits only)	0.9	Strips all formatting
`name` / `fullName`	Jaro-Winkler	0.8	Best for short strings
`firstName` / `lastName`	Jaro-Winkler	0.5 each	Combined when both present
`company` / `companyName`	Token-set ratio	0.7	Handles word-order differences
`domain` / `website`	Exact (extracted)	0.7	Ignores free email domains
`city`	Exact	0.3	Supplementary signal
`title` / `jobTitle`	Token-set ratio	0.3	Weak signal, helps disambiguation

Custom match fields

Override auto-detection with explicit field configuration:

{
  "matchFields": [
    { "field": "email", "method": "exact", "weight": 1.0 },
    { "field": "name", "method": "jaro-winkler", "weight": 0.8 },
    { "field": "company", "method": "token-set", "weight": 0.6 }
  ]
}

Available methods: exact, jaro-winkler, levenshtein, token-set

Threshold

The threshold parameter (default 0.85) controls how similar two records must be to be considered duplicates:

0.95 — Very strict. Only near-identical records merge. Minimizes false positives.
0.85 — Balanced default. Catches most real duplicates with few false positives.
0.70 — Aggressive. Catches more duplicates but may over-merge.
0.50 — Very aggressive. Only use for exploration or when you'll manually review.

When in doubt, start strict (0.90) and lower if you're missing duplicates.

How many records can you deduplicate?

The actor handles up to 100,000 records per run. Multi-pass blocking avoids the O(n^2) comparison problem — for 10,000 records, only ~100K-300K comparisons are needed instead of 50 million.

For best performance, clean your data with Superclean actors first. Normalized names and companies produce dramatically better match results.

How much will it cost you?

Records	Cost
1,000	$0.50
10,000	$5.00
100,000	$50.00

Volume discounts apply automatically:

Bronze (100+ records): $0.45/1K
Silver (1,000+ records): $0.40/1K
Gold (10,000+ records): $0.35/1K

Input parameters

Parameter	Type	Description
`records`	array	Array of record objects to deduplicate
`record`	object	Single record — API shorthand. If both `record` and `records` are provided, `record` is prepended to the list
`threshold`	number	Similarity threshold for dedup (0-1, default 0.85)
`matchFields`	array	Optional field configuration (auto-detected if omitted)

Input example

{
  "records": [
    { "name": "Jim Smith", "email": "jim@acme.com", "company": "ACME Corp" },
    { "name": "James Smith", "email": "jim@acme.com", "company": "Acme Corporation" },
    { "name": "Jane Doe", "email": "jane@example.com", "company": "Example Inc" },
    { "name": "J. Doe", "email": "jane@example.com", "company": "Example" },
    { "name": "Bob Wilson", "email": "bob@other.com", "company": "Other LLC" }
  ]
}

During the Actor run

The Actor processes records in four stages:

Auto-detect — Scan record keys and select matching fields/methods
Block — Multi-pass blocking groups candidate pairs (avoids comparing every pair)
Score — Weighted multi-field similarity scoring for each candidate pair
Cluster — Union-Find groups transitive matches; select canonical records

You'll see progress logs as records are processed. Results are available in real-time.

Output format

Results are saved to the default dataset. Each input record appears in the output, annotated with its cluster membership.

Output example

[
  {
    "id": 1,
    "record": { "name": "Jim Smith", "email": "jim@acme.com", "company": "ACME Corp" },
    "clusterId": "c_1",
    "clusterSize": 2,
    "isCanonical": true,
    "duplicateOf": null,
    "matchScore": null,
    "matchReasons": [],
    "confidence": 1
  },
  {
    "id": 2,
    "record": { "name": "James Smith", "email": "jim@acme.com", "company": "Acme Corporation" },
    "clusterId": "c_1",
    "clusterSize": 2,
    "isCanonical": false,
    "duplicateOf": 1,
    "matchScore": 0.92,
    "matchReasons": ["email_exact", "name_jaro-winkler:0.87", "company_token-set:0.91"],
    "confidence": 0.92
  },
  {
    "id": 5,
    "record": { "name": "Bob Wilson", "email": "bob@other.com", "company": "Other LLC" },
    "clusterId": "c_3",
    "clusterSize": 1,
    "isCanonical": true,
    "duplicateOf": null,
    "matchScore": null,
    "matchReasons": [],
    "confidence": 1
  }
]

Field	Description
`id`	Row number (1-based, matches input order)
`record`	Original input record object
`clusterId`	Cluster identifier — records in the same cluster are duplicates
`clusterSize`	Number of records in this cluster (1 = unique, >1 = has duplicates)
`isCanonical`	True if this is the "best" (most complete) record in the cluster
`duplicateOf`	ID of the canonical record this is a duplicate of (null if canonical)
`matchScore`	Composite similarity score against the canonical record (0-1)
`matchReasons`	Which fields matched and their individual scores
`confidence`	Overall confidence in the duplicate match (0-1)

How to get the deduplicated set

Filter the output to isCanonical === true:

const deduped = results.filter(r => r.isCanonical);

Standby mode (instant pair comparison)

Standby mode provides instant comparison of two records without running a full batch. Ideal for real-time dedup in agent workflows, Clay enrichment steps, and form validation.

Standby URL

https://superlativetech--superclean-dedupe.apify.actor?token=YOUR_API_TOKEN

Compare two records

$curl "https://superlativetech--superclean-dedupe.apify.actor?token=TOKEN&a=%7B%22name%22%3A%22Jim+Smith%22%2C%22email%22%3A%22jim%40acme.com%22%7D&b=%7B%22name%22%3A%22James+Smith%22%2C%22email%22%3A%22jim%40acme.com%22%7D"

Query parameters

Parameter	Required	Description
`a`	Yes	First record as URL-encoded JSON
`b`	Yes	Second record as URL-encoded JSON
`threshold`	No	Similarity threshold (default 0.85)

Response format

{
  "isDuplicate": true,
  "score": 0.92,
  "reasons": ["email_exact", "name_jaro-winkler:0.87"],
  "confidence": 0.92,
  "fieldsUsed": ["email", "name", "company"]
}

Limitations

Does not perform cross-run dedup (each run is independent)
Fuzzy matching works best on English text; non-Latin scripts may produce lower accuracy
Maximum ~100,000 records per run (256 MB memory constraint)
Auto-detection requires standard field names (email, name, company, phone, etc.)
Custom field names need explicit matchFields configuration

Integrations

Superclean Dedupe works with any tool that can call Apify Actors:

Clay — Add as a dedup step after enrichment
Make — Use the Apify module to run the Actor
Zapier — Trigger runs and retrieve results automatically
n8n — Self-hosted workflow automation

Using Superclean Dedupe with the Apify API

Node.js:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('superlativetech/superclean-dedupe').call({
  records: [
    { name: 'Jim Smith', email: 'jim@acme.com', company: 'ACME Corp' },
    { name: 'James Smith', email: 'jim@acme.com', company: 'Acme Corporation' },
    { name: 'Bob Wilson', email: 'bob@other.com', company: 'Other LLC' },
  ]
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
const deduped = items.filter(r => r.isCanonical);
console.log(`${items.length} records → ${deduped.length} unique`);

Python:

from apify_client import ApifyClient

client = ApifyClient('YOUR_API_TOKEN')

run = client.actor('superlativetech/superclean-dedupe').call(run_input={
    'records': [
        {'name': 'Jim Smith', 'email': 'jim@acme.com', 'company': 'ACME Corp'},
        {'name': 'James Smith', 'email': 'jim@acme.com', 'company': 'Acme Corporation'},
        {'name': 'Bob Wilson', 'email': 'bob@other.com', 'company': 'Other LLC'},
    ]
})

items = client.dataset(run['defaultDatasetId']).list_items().items
deduped = [r for r in items if r['isCanonical']]
print(f"{len(items)} records → {len(deduped)} unique")

Check out the Apify API reference for full details, or click the API tab above for more code examples.

Your feedback

We're always improving Superclean Actors. If you have feature requests, find a bug, or need help with a specific use case, please open an issue in the Actor's Issues tab.

Leave a review

If Superclean Dedupe saves you time or improves your lead data, please leave a review. Your feedback helps other users discover the tool and helps us understand what's working well.

Built by Superlative

Product Matching API

vivid_astronaut/product-matching

Fabio Suizu

Array to Excel

hamza.alwan/array-to-excel

Converts any array of objects to Excel

Hamza Alwan

Fuzzy Search Dataset Actor

dtrungtin/fuzzy-search-dataset-actor

Search any Apify dataset using typo-tolerant fuzzy matching.

Tin

CRM Contact Cleanup & Dedupe Prep

critd/contact-cleanup

Clean supplied URL, email, and address fields for contact records, preserving one row per input with changed-field, review, dedupe-key, and cross-field signals. Does not scrape, find, verify, enrich, geocode, score confidence, choose survivors, or merge contacts.

Critical Distinction

Entity Deduplication Matcher

junipr/entity-deduplication-matcher

Fuzzy-match and deduplicate company, product, location, or entity rows into canonical records.

junipr

Company Scout

tri_angle/company-scout

An Actor that finds YCombinator companies matching your criteria, enriches each with its matching LinkedIn company record, and scrapes LinkedIn job postings for the resulting shortlist.

Tri⟁angle

JobMatch AI

peaceful_mix/JobMatch-AI

Intelligent system matching resumes to jobs using AI. 📄 Resume Analysis: Extracts data from PDFs. 🤖 AI Matching: Uses Gemini AI for accurate matching. 🎯 Smart Scoring: Provides a suitability score (0-100). 💡 Insights: Gives match reasons and prep tips. ⚡ Fast: Quick analysis of many postings.

Vidip Ghosh

DexScreener Token Listings & Pair Scraper

crawlerbros/dexscreener-token-listings

Scrape DexScreener.com - latest token listings, trending pairs, pair searches, and token details. Free, no API key required. Covers 50+ blockchains.

Crawler Bros

DexScreener Token Listings & Pair Scraper

crawlergang/dexscreener-token-listings

Scrape DexScreener.com - latest token listings, trending pairs, pair searches, and token details. Free, no API key required. Covers 50+ blockchains.

Crawler Gang

5.0

HubSpot Company Enrichment & Fuzzy Matcher for Clay

alizarin_refrigerator-owner/hubspot-company-enrichment-fuzzy-matcher-for-clay

Fuzzy match and enrich companies against your HubSpot CRM using multi-signal matching (domain, company name, phone, location). Returns HubSpot ID, lifecycle stage, deal status & confidence scores. Perfect for Clay workflows, lead deduplication, and outbound enrichment.

The Howlers

Superclean Dedupe

What does Superclean Dedupe do?

Use with AI Agents

Output schema

Pipeline composability

Standby (instant pair comparison)

What else can Superclean do?

Why deduplicate lead data?

How to use Superclean Dedupe

How matching works

Smart field auto-detection

Custom match fields

Threshold

How many records can you deduplicate?

How much will it cost you?

Input parameters

Input example

During the Actor run

Output format

Output example

How to get the deduplicated set

Standby mode (instant pair comparison)

Standby URL

Compare two records

Query parameters

Response format

Limitations

Integrations

Using Superclean Dedupe with the Apify API

Your feedback

Leave a review

You might also like

Product Matching API

Array to Excel

Fuzzy Search Dataset Actor

CRM Contact Cleanup & Dedupe Prep

Entity Deduplication Matcher

Company Scout

JobMatch AI

DexScreener Token Listings & Pair Scraper

DexScreener Token Listings & Pair Scraper

HubSpot Company Enrichment & Fuzzy Matcher for Clay