- CRITICAL: Playwright email fallback was dead code —
routes.js called extractEmailFast directly, bypassing extractEmailFromWebsite. Now routes.js properly calls extractEmailFromWebsite which tries fast HTTP extraction first, then falls back to Playwright for JS-heavy sites (~20-30% more emails recovered)
- CRITICAL: Deobfuscation rule
/\s+at\s+/ was too broad — replaced "look at this" → "look@this". Now uses context-aware lookbehind/lookahead in both email-extractor.js and fast-email-extractor.js. Same fix applied to /\s+dot\s+/, /\s+DOT\s+/, /\s+AT\s+/
successRate comparison in main.js was string vs number (.toFixed() returns string) — now wrapped with Number()
postmaster@ scoring penalty was documented in CHANGELOG but never implemented — now adds -10 score penalty in scoreEmail()
CONTACT_KEYWORDS.SL (Slovenian) was referenced in fast-email-extractor.js but missing from keywords.js — added Slovenian contact, legal, and path keywords
- Cache cleanup (
shouldRunCleanup/cleanupCache) was never called proactively — now runs every 5 places in the SEARCH handler loop
- Default router handler passed partial
{page, request} instead of full crawling context — now forwards complete context object
- CSS selectors reordered to prefer stable role-based/aria selectors over fragile Google class names (
ZkP5Je, ceNzKf, HlvSq)
extractEmailFromWebsite returned early with needsPlaywright: true when fast extract found socials but no email — Playwright fallback was skipped. Now continues to Playwright and preserves fastSocials through all return paths
- Added missing
BUSINESS_EMAIL_PREFIXES.SL for Slovenian (was added to contact/legal/paths but missed in email prefixes)
- Dead exported functions never imported anywhere:
getProxyStats, resetProxyStats (email-extractor.js), getCachedWebsite, setCachedWebsite, updateRateLimitConfig, getCleanupState, getMemoryPressureLevel (cache-utils.js), validateAndCleanBusinessData (validators.js)
checkRateLimit unexported (only used internally by enforceRateLimit within cache-utils.js)
- ESLint quality rules: eqeqeq, prefer-const, no-var, no-throw-literal, no-implicit-coercion, no-duplicate-imports
- Fallback selectors for Google Maps end-of-list detection (role-based and aria-based)
- Error handling for Impit HTTP client initialization
- Rejected emails cap (MAX_REJECTED_EMAILS = 100) to prevent memory leaks
- MEMORY_THRESHOLDS.minHeapMBForPressure centralized constant
- Targeted hover selectors for email-related elements only
- Targeted pseudo-element selectors for email extraction
- Shadow DOM scanning with targeted selectors instead of wildcard
- All 35+ language keywords now used in fast email contact page discovery
- Updated Playwright from 1.49.1 to 1.58.2 (Chromium 145+, Firefox 146+, WebKit 26+)
- Updated Apify SDK from 3.5.2 to 3.6.0
- Updated Crawlee from 3.15.3 to 3.16.0
- Updated Apify CLI from 1.1.1 to 1.2.1
- Updated Docker base image to apify/actor-node-playwright-chrome:22-1.58.2
- Updated check-playwright-version.mjs to expect 1.58.2
- Email TLD regex expanded from {2,6} to {2,10} to support longer TLDs
- Replaced fragile Google Maps CSS class selectors with role-based/aria selectors
- Replaced querySelectorAll('*') with getComputedStyle with targeted selectors (performance fix)
- Replaced mouseOver dispatch on ALL DOM elements with targeted email-related selectors only
- Replaced unsafe Math.min(...array) spread with reduce() to prevent stack overflow
- Replaced complex nested ternary timeout logic with lookup table in main.js
- Suspicious email pattern detection made less aggressive (fewer false rejections)
- postmaster@ emails moved from hard rejection to scoring penalty only
- Fast email extractor now collects from all sources instead of early-returning on Schema.org
- Removed unused _domain parameter from extractEmailsFromHtml
- Contact page keyword matching uses all 35+ languages instead of only EN+DE
- All log imports unified to use logger.js wrapper (email-extractor, fast-email-extractor, cache-utils)
- Actor.fail() now properly stops execution with return statement (prevented null.trim() crash)
- Removed invalid --max-old-space-size=4096 from browser --js-flags (not a V8 flag)
- Removed unused _scanAborted variable from email-extractor.js
- Removed unused BASE_DELAY constant from config.js
- Removed duplicate 'ansprechpartner' keyword in DE contact keywords
- Removed duplicate 'find us' keyword in EN contact keywords
- Removed unnecessary input = rawInput assignment in main.js
- Removed duplicate closeImpit() call in catch block (already in finally)
- Fixed deobfuscation rule \bat\b replacing standalone word "at" with @ (now context-aware)
- Dockerfile now runs npm cache clean --force before removing ~/.npm
- Playwright 1.58.2 includes latest Chromium security patches
- All dependencies updated to latest stable versions
- Fixed ajv ReDoS vulnerability (GHSA-2g4f-4pwh-qvx6) via npm override to >=8.18.0
- Fixed tar 4x HIGH vulnerabilities (GHSA-r6q2, GHSA-34x7, GHSA-8qq5, GHSA-83g3) via npm override to >=7.5.8
- Fixed tmp symlink vulnerability (GHSA-52f5) via npm override to >=0.2.4
- Eliminated deprecated rimraf@3/glob@7/inflight chain via npm overrides (rimraf>=6.1.0, glob>=10.4.0)
- npm audit: 0 vulnerabilities (was 4: 1 moderate + 3 high)
- Separate input fields for businessType and location
- Hybrid email extraction using impit + cheerio as primary method
- Playwright fallback for JavaScript-heavy websites in email extraction
- Schema.org and JSON-LD structured data email extraction
- vCard and hCard microformat email detection
- Footer-first scanning for faster email discovery
- Social media profile extraction (LinkedIn, Facebook, Twitter, Instagram, TikTok)
- Google Maps place URL extraction
- Norwegian, Danish and Indonesian language support for contact page detection
- Node.js engine requirement (>=22.0.0)
- ESLint flat configuration for ES modules
- Prettier configuration for consistent code formatting
- Enhanced email validation with false positive filtering
- Input now uses businessType and location instead of single searchQuery field
- Email extraction is now 3-5x faster with the hybrid approach
- Improved data extraction accuracy and reliability
- Optimized extraction speed (2.5 seconds per result)
- Reduced wait times for faster extraction
- Rating extraction now uses F7nice selector as primary method
- Email regex updated with word boundaries to prevent garbage capture
- Email validation now filters local parts longer than 40 characters
- Email validation detects and rejects CamelCase concatenated text patterns
- Updated Playwright to 1.49.1
- Updated all dependencies to latest stable versions
- Replaced legacy ESLint config with modern flat config
- Improved input validation with proper type checking
- Simplified architecture for better performance
- Input schema default and prefill values now consistent (both 3)
- Output schema type corrected to array
- Removed unused imports and dead code
- Search query validation now properly fails when not provided
- Rating extraction improved with better fallback logic
- Reviews count extraction accuracy improved
- Email extraction no longer captures garbage text after TLD
- False positive emails like concatenated page text now filtered
- Fixed Playwright SSL certificate verification vulnerability
- Fixed body-parser denial of service vulnerability
- Fixed glob CLI command injection vulnerability
- Fixed js-yaml prototype pollution vulnerability
- Resolved all npm audit vulnerabilities (0 remaining)
- Nominatim geocoding integration (no longer needed)
- Grid search functionality (replaced with scroll-based loading)
- Zone-based request splitting
- Unused CONSENT_KEYWORDS arrays
- Unused OUTPUT_FORMATS constant
- Conflicting eslint-config-airbnb dependency
Local benchmark results with 97 businesses:
- Total extraction time: 6.5 minutes
- Average per result: 2.5 seconds
- Email extraction: 0.3 seconds per website
- Address coverage: 100%
- Phone coverage: 98%
- Website coverage: 99%
- Rating coverage: 100%
- Simplified actor to focus on core functionality: search query, results count, and email extraction
- Migrated Power Mode to dedicated actor (Unified Serper.dev ETL Processor)
- Power Mode functionality (moved to separate actor)
- Advanced configuration options
- Backward compatibility for categories parameter names
- Playwright version compatibility validation script
- Dockerfile CMD command reference
- Actor schema file paths
- Dataset schema types for rating and reviews
- Unused variables and redundant conditions
- Mutable rate limit config causing shared state issues
- Duplicate variable declarations
- Moved apify-cli from dependencies to devDependencies
- Category extraction from Google Maps
- Resource blocking for email extraction (50-60% bandwidth reduction)
- Dual-proxy architecture (DATACENTER for email, RESIDENTIAL for Maps)
- Bandwidth monitoring per extraction
- Safeguard in category filter for standard crawler
- Race condition in email extraction
- ChargingManager crash in failedRequestHandler
- Browser closed error during keyboard input
- Timeout inconsistency between handlers
- TypeError when reading null innerText
- Navigation to mailto links
- Frame detached errors
- Category filter eliminating all results
- Request handler timeout capped at 540 seconds
- Batch size reduced to 5 when email extraction is enabled
- Improved proxy stats and bandwidth logging