GTM Leads Cleaner avatar
GTM Leads Cleaner
Under maintenance

Pricing

Pay per event

Go to Apify Store
GTM Leads Cleaner

GTM Leads Cleaner

Under maintenance

Upload any lead CSV and get a CRM-ready dataset: email validation, name/company cleanup, job-title bucketing, and dedupe by email or domain+name.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Howard

Howard

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

10 days ago

Last modified

Share

GTM Leads Cleaner - CSV Lead Deduplication & Email Validation

What is GTM Leads Cleaner?

GTM Leads Cleaner is an Apify Actor that cleans, normalizes, and deduplicates GTM (Go-To-Market) lead data from CSV files. Built for sales teams, RevOps professionals, and marketers who need to prepare leads for CRM import with validated emails, standardized names, and categorized job titles.

See It In Action

🎬 Video demo coming soon!

Why Use GTM Leads Cleaner?

  • βœ… Save hours of manual work - Process 10,000 leads in 2-3 minutes
  • βœ… Improve CRM data quality - Validated emails, standardized names, clean formatting
  • βœ… Better lead routing - GTM-focused job title categorization for accurate scoring
  • βœ… Smart deduplication - Match by email or domain+name combination
  • βœ… Pay only for what you use - Just $0.001 per lead processed

Use Cases

Clean CRM Exports Before Re-Import

Export your HubSpot, Salesforce, or Pipedrive contacts, run them through the cleaner, and re-import with normalized data and duplicates removed.

Deduplicate Leads from Multiple Sources

Combine leads from trade shows, webinars, content downloads, and scraped data into a single clean list without duplicates.

Prepare Sales Intelligence Exports

Clean exports from Apollo.io, ZoomInfo, or LinkedIn Sales Navigator before loading into your CRM or sales engagement platform.

Standardize Job Titles for Lead Scoring

Categorize job titles into consistent GTM buckets (Founder/C-level, Sales leadership, Marketing IC, etc.) for accurate lead scoring and routing.

Features

  • πŸ“§ Email Validation & Normalization - Trims whitespace, lowercases, validates format, and extracts first email from multi-email fields
  • πŸ‘€ Name Processing - Splits full names into first/last, normalizes whitespace
  • 🏒 Company Normalization - Cleans company names, removes extra whitespace
  • 🌐 Domain Extraction - Derives domain from email or website column
  • 🎯 Job Title Bucketing - Categorizes job titles into 10 GTM-focused buckets
  • πŸ”„ Lead Deduplication - Finds duplicates by email or domain+name combination
  • πŸ” Auto Column Detection - Automatically detects column mappings from various header formats
  • πŸ’° Pay-Per-Event Pricing - Only pay for leads you actually process

How Much Does It Cost to Clean Leads?

The GTM Leads Cleaner uses Apify's pay-per-event pricing model:

VolumeCost per LeadExample
Any volume$0.0011,000 leads = $1.00

Cost Comparison

MethodCost for 10,000 LeadsTime
GTM Leads Cleaner~$102-3 minutes
Manual cleaning$200-500 (VA time)8-20 hours
Custom script$0 + dev timeHours to build

Typical run times:

  • 1,000 rows: ~30 seconds
  • 10,000 rows: ~2-3 minutes
  • 100,000 rows: ~15-20 minutes

Tutorial: How to Clean Your Lead CSV

Step 1: Prepare Your CSV

Ensure your CSV file:

  • Is UTF-8 encoded
  • Has a header row
  • Contains at minimum an email column

Step 2: Upload Your File

You have three options:

  1. File Upload - Use the file upload button in the Apify Console
  2. URL - Provide a direct URL to your CSV file
  3. Key-Value Store - Reference a file already in your Apify Key-Value Store

Step 3: Configure Options

{
"inputFile": "leads.csv",
"dedupeStrategy": "email",
"outputFormat": "dataset",
"includeDuplicates": false
}

Key options:

  • dedupeStrategy: Choose "email" for email-based matching or "domain+name" for fuzzy matching
  • outputFormat: "dataset" for API access or "csv" for downloadable file
  • includeDuplicates: Set to true if you want to see duplicate rows (marked with is_duplicate=true)

Step 4: Run and Download Results

  1. Click "Start" to run the Actor
  2. Wait for completion (check the "Runs" tab for progress)
  3. Download results from the "Storage" tab:
    • Dataset: Clean leads in JSON format
    • Key-Value Store: cleaned_leads.csv (if CSV output enabled) and SUMMARY stats

Input Schema

ParameterTypeDefaultDescription
inputFilestringrequiredCSV file (upload, URL, or KV store key)
dedupeStrategyenum"email""email" or "domain+name"
outputFormatenum"dataset""dataset" or "csv"
includeDuplicatesbooleanfalseKeep duplicate rows in output
autoDetectPreferenceenum"first"Tie-breaking: "first", "last", or "fail"
emailColumnstringautoManual email column override
nameColumnstringautoManual name column override
companyColumnstringautoManual company column override
jobTitleColumnstringautoManual job title column override
fieldMapobject{}Programmatic column mapping (highest priority)

Deduplication Strategies

  • email - Matches on normalized email address. First occurrence is primary, subsequent matches are marked as duplicates.
  • domain+name - Matches on normalized full name + domain combination. Useful when the same person appears with different email addresses.

Auto-Detection Preferences

When multiple columns match a pattern (e.g., both "Email" and "Work Email"):

  • first - Uses the first matching column (leftmost in CSV)
  • last - Uses the last matching column (rightmost in CSV)
  • fail - Aborts with error listing candidates

Output Format

Dataset Output (default)

Each row is pushed to the Apify default dataset with canonical fields:

{
"original_row_index": 1,
"email": "JANE@ACME.COM",
"normalized_email": "jane@acme.com",
"email_is_valid": true,
"full_name": "Jane Doe",
"first_name": "Jane",
"last_name": "Doe",
"company": "Acme Inc",
"domain": "acme.com",
"role_raw": "Head of Growth",
"role_bucket": "Marketing leadership",
"is_duplicate": false,
"duplicate_of_index": null,
"dedupe_strategy_used": null,
"source_file": "leads.csv",
"error_message": null
}

CSV Export

When outputFormat: "csv", a cleaned_leads.csv file is written to the Key-Value Store with:

  1. Canonical GTM fields (fixed order)
  2. Original columns (preserved order)

Summary Statistics

A SUMMARY JSON is always written to the Key-Value Store:

{
"total_rows": 1000,
"processed_rows": 1000,
"duplicate_rows": 50,
"unique_leads": 950,
"invalid_email_rows": 25,
"input_file_name": "leads.csv",
"dedupe_strategy": ["email"],
"warnings": [],
"created_at": "2024-01-15T10:30:00Z"
}

Job Title Buckets for Lead Categorization

Job titles are automatically categorized into GTM-focused buckets (9 defined + "Other" fallback):

BucketExample Keywords
Founder / C-levelfounder, ceo, cto, cfo, chief, president, owner
RevOps / SalesOpsrevops, revenue operations, sales operations, crm manager
Marketing leadershiphead of marketing, vp marketing, marketing director, growth lead
Sales leadershiphead of sales, vp sales, sales director, sales manager
Marketing ICmarketing specialist, demand gen specialist, content marketer
Sales ICaccount executive, sdr, bdr, business development
Productproduct manager, product owner, product lead
Engineering / Technicalengineer, developer, architect, devops
Customer Successcustomer success, csm, account manager, onboarding
Other(default fallback for unmatched titles)

CSV Column Auto-Detection

The Actor recognizes common header variations:

FieldRecognized Headers
Emailemail, e-mail, work email, contact email
Full Namename, full name, contact, person
First Namefirst name, given name, first
Last Namelast name, surname, family name, last
Companycompany, organization, org, employer
Job Titletitle, job title, position, role
Domaindomain, website, url, company domain

Headers are matched case-insensitively.

Error Handling

Fatal Errors (Actor fails)

  • Invalid file format (not .csv)
  • UTF-8 decode failure
  • Missing required email column
  • Empty input file
  • Tie-breaking with "fail" preference when multiple candidates exist

Row-Level Errors

Rows with processing errors continue through the pipeline with:

  • error_message field set
  • email_is_valid set to false
  • Other fields populated where possible

Warnings

Non-fatal issues are logged and included in the summary:

  • High duplicate rate (>30%)
  • High invalid email rate (>20%)
  • Column detection ambiguities

Integrations & API Access

Zapier Integration

  1. Use the "Apify" app in Zapier
  2. Select "Run Actor" action
  3. Choose "gtm-leads-cleaner" Actor
  4. Map your CSV file URL to the inputFile parameter
  5. Use "Get Dataset Items" to retrieve cleaned leads

Make.com (Integromat)

  1. Add the Apify module to your scenario
  2. Use "Run an Actor" action
  3. Configure input with your CSV file
  4. Use "Get Dataset Items" to retrieve results
  5. Route cleaned leads to your CRM module

n8n Workflow

  1. Use the Apify node
  2. Set operation to "Run Actor"
  3. Configure the Actor ID and input parameters
  4. Use HTTP Request node to fetch dataset results
  5. Connect to your CRM node (HubSpot, Salesforce, etc.)

Python SDK

from apify_client import ApifyClient
client = ApifyClient("your-api-token")
actor = client.actor("your-username/gtm-leads-cleaner")
run = actor.call(run_input={
"inputFile": "https://example.com/leads.csv",
"dedupeStrategy": "email",
"outputFormat": "dataset"
})
# Get results
dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
print(item["normalized_email"], item["is_duplicate"])

JavaScript / Node.js SDK

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'your-api-token' });
const run = await client.actor('your-username/gtm-leads-cleaner').call({
inputFile: 'https://example.com/leads.csv',
dedupeStrategy: 'email',
outputFormat: 'dataset'
});
// Get results
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
console.log(item.normalized_email, item.is_duplicate);
});

Direct API Call

curl -X POST "https://api.apify.com/v2/acts/your-username~gtm-leads-cleaner/runs?token=your-api-token" \
-H "Content-Type: application/json" \
-d '{
"inputFile": "https://example.com/leads.csv",
"dedupeStrategy": "email"
}'

FAQ

What CSV formats are supported?

The Actor supports standard UTF-8 encoded CSV files with a header row. Files must have the .csv extension. The Actor handles various delimiters and quote characters automatically.

Can I use custom column mappings?

Yes! You have three options:

  1. Individual overrides: Use emailColumn, nameColumn, companyColumn, or jobTitleColumn to specify exact header names
  2. Field map: Use the fieldMap parameter for programmatic mapping of all fields at once
  3. Auto-detection: Let the Actor detect columns automatically (works with most common header formats)

How does deduplication work?

The Actor supports two deduplication strategies:

  • Email-based: Compares normalized email addresses (lowercased, trimmed). First occurrence is kept as the primary record.
  • Domain+Name: Compares the combination of domain (from email or website) and normalized full name. Useful when the same person has multiple email addresses.

Duplicates are either filtered out (default) or marked with is_duplicate=true and duplicate_of_index pointing to the primary record (when includeDuplicates=true).

What happens to invalid emails?

Rows with invalid emails are still processed and included in the output. They are marked with:

  • email_is_valid: false
  • normalized_email: The original email (lowercased and trimmed)
  • All other fields are processed normally

You can filter these out in your downstream system or use the email_is_valid field for conditional logic.

Does it support pay-per-event pricing?

Yes! The Actor uses Apify's pay-per-event model. You're charged $0.001 per processed lead, meaning you only pay for what you use. The pricing appears as "Charged for X events" in your Apify billing.

Can I keep duplicate rows in the output?

Yes, set includeDuplicates: true in your input. Duplicates will be included but marked with is_duplicate: true and duplicate_of_index showing which record they duplicate.

What's the maximum file size?

There's no hard limit, but for optimal performance:

  • Files under 100MB process quickly
  • Larger files may require more memory (adjust in Actor settings)
  • For very large files (1M+ rows), consider splitting into chunks

Development

Local Development

# Install dependencies
uv sync
# Run tests
uv run pytest tests/ -v
# Run locally
apify run

Test Commands

# Run all tests
uv run pytest tests/ -v
# Run with coverage
uv run pytest tests/ --cov=src --cov-report=html
# Run specific test file
uv run pytest tests/test_integration.py -v

Looking for more data processing and lead generation tools? Check out these related Actors:

Resources

License

Apache 2.0