Changelog
All notable changes to the GTM Leads Cleaner Actor will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[1.0.0] - 2024-12-14
Added
-
CSV Parsing
- Auto-delimiter detection (comma or semicolon)
- UTF-8 encoding support
- Header row extraction
-
Column Auto-Detection
- Automatic detection for email, name, company, job title, and domain columns
- Case-insensitive header matching
- Support for common header variations (e.g., "Work Email", "E-mail", "Contact Email")
- Configurable tie-breaking preference (first, last, fail)
- Manual column override support
- Programmatic field mapping via
fieldMap parameter
-
Email Normalization
- Whitespace trimming
- Lowercase conversion
- Email format validation (RFC-compliant regex)
- Multi-email field handling (extracts first valid email)
-
Name Processing
- Full name splitting into first/last components
- Whitespace normalization
- Support for first_name + last_name column fallback to full_name
-
Company Normalization
- Whitespace trimming and normalization
-
Domain Derivation
- Extraction from website/domain column
- Fallback extraction from email address
- URL parsing and www prefix removal
-
Job Title Bucketing
- 10 GTM-focused categories: Marketing, Sales, Executive, Product, Customer Success, RevOps, Engineering, HR, Other, Unknown
- Keyword-based classification
-
Deduplication Engine
- Email-based deduplication strategy
- Domain+name combination strategy
- First occurrence tracking (primary record)
- Duplicate linking to primary record index
-
Output Generation
- Apify Dataset output with all canonical fields
- Optional CSV export to Key-Value Store
- SUMMARY JSON with processing statistics
- Original column preservation alongside canonical fields
-
Error Handling
- Fatal errors for invalid input (non-CSV, encoding issues, missing email column)
- Row-level error handling with error_message field
- Warning collection for non-fatal issues
- Threshold-based warnings (high duplicate rate, high invalid email rate)
Infrastructure
- Python 3.13+ support
- Pydantic v2 models for input validation
- Comprehensive test suite (400+ tests)
- Integration tests with fixture files