Deprecated

Pricing

Pay per event

See alternative Actors

Go to Apify Store

🌐📍 LeadLocator Pro

Deprecated

See alternative Actors

This powerful Google Maps scraper helps you: 📍 Extract local business data from any location worldwide 📞 Collect contact information including phone numbers and emails 🏢 Build targeted lead lists for B2B sales and marketing ⭐ Filter by ratings and reviews to find quality prospects

Pricing

Pay per event

Rating

5.0

(1)

Developer

Eliud Munyala

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

🛍️ Dubizzle Search Scraper - Cheap 🏷️

scrapestorm/dubizzle-search-scraper---cheap

Collect classified listings from Dubizzle.com by keyword 🔍. Get titles, prices 💸, locations 📍, categories 🗂️, product details 📋, and photo galleries 🖼️. Ideal for resellers, market analysts, bargain hunters, and anyone looking to explore listings or gather product data quickly and efficiently.

Storm_Scraper

4.9

(13)

Google Maps Business Scraper"⭐ BEST NO-BS🖕

successful_nonagon/google-maps-business-scraper-best-no-bs

Extract business names, phone numbers, addresses, ratings, and reviews from Google Maps. Perfect for lead generation, sales prospecting, and market research.

hafsah nuzhat

Zillow Agent Scraper (All-in-one) 🏡📧📞🤖

scrapestorm/zillow-agent-scraper-all-in-one

Unlock the power of Zillow with automated agent profile scraping (All-in-one) – effortlessly gather insights on Real Estate Agents, Home Improvement Pros, Property Managers, Inspectors, and Photographers 📸🏡🔑. Customize advanced queries 📈. No proxy needed for fast, efficient scraping! ⚡

Storm_Scraper

262

4.5

(8)

Pagesjaunes Scraper

silentflow/pagesjaunes-scraper

PagesJaunes scraper France - Extract 130+ business data fields from PagesJaunes.fr: emails, phone numbers, SIRET codes, addresses, reviews, ratings, opening hours, social media links. Automated yellow pages scraper for lead generation, B2B prospecting, market research. Free trial available.

SilentFlow

Pagesjaunes Scraper Ppr

silentflow/pagesjaunes-scraper-ppr

PagesJaunes scraper PPR - Pay per result pricing! Extract 130+ business fields from PagesJaunes.fr yellow pages: emails, phones, SIRET, addresses, reviews, ratings. Perfect for lead generation in France. Residential proxy included. No upfront costs, only pay for successful results.

SilentFlow

GoogleMapsScrapperPro

lagic/G-M-A-P

Scrape Google Maps at 190+ results/min. Extract business name, phone, email, website, address, rating, reviews & category. Built-in email extraction from business websites. No proxies needed stealth engine bypasses detection natively. Real-time streaming. Email-only filter for pure lead lists.

LAGIC

Instagram Scraper Pro

red.cars/instagram-scraper-pro

Extract Instagram profiles, posts, hashtags, locations, and comments with analytics. Perfect for social media marketing, competitor analysis, influencer research, market intelligence, and brand monitoring. No Instagram API limits, no complex setup - reliable Instagram data extraction at scale.

AutomateLab

1.0

(1)

Google Maps Scraper Pro

red.cars/google-maps-scraper-pro

Professional Google Maps scraper for lead generation, market research, and business intelligence. Extract verified business contacts, ratings, and competitor analysis with 90% accuracy. Perfect for sales teams, marketing agencies, and investment research. Try FREE for 14 days!

AutomateLab

Google Map Scraper Pro By WebUnlocker

webunlocker/google-map-scraper-pro-by-webunlocker

🏆 CHEAPEST Google Maps Scraper - Just $3/1,000 places! The fastest, most cost-effective scraper on Apify. Extract businesses with names, ratings, reviews, phones, websites, addresses, GPS coordinates, price levels, opening hours, weekly schedules, and 100+ HD photos per place. 70% cheaper!

WebUnlocker

Guiamais Business Scraper 🌐📊🇧🇷 - Cheap/barato

scrapestorm/guiamais-business-scraper---cheap-barato

Easily collect business listings from Guiamais, one of Brazil’s largest local directories. Simply enter a keyword and optional location to extract key insights like address, phone, rating & more ⭐ Seamlessly integrate the data into ur analytics tools for market research / competitor analysis 📊

Storm_Scraper

5.0

(1)

RapidAPI Search Scraper - Cheap 🔍🚀🌐

scrapestorm/rapidapi-search-scraper---cheap

🔍 Scrape APIs – RapidAPI Marketplace Search 🌐 Enter a search keyword to collect API data Supports all API categories on RapidAPI's marketplace, including Weather, AI, Translation, E-commerce, Social Media, Finance & any other functionality available through the platform

Storm_Scraper

5.0

(1)

# Python __pycache__/ *.py[cod] *$py.class *.so .Python env/ venv/ .venv .env .claude # Apify apify_storage/ storage/ # IDE .vscode/ .idea/ *.swp *.swo # OS .DS_Store Thumbs.db # Logs *.log # Test files test_*.py *_test.py # Local storage data/ output/ *.csv *.xlsx # Track essential files !input_schema.json !.actor # Added by Apify CLI storage node_modules

# First, specify the base Docker image. # You can see the Docker images from Apify at https://hub.docker.com/r/apify/. # You can also use any other image from Docker Hub. FROM apify/actor-python-playwright:3.13 # Second, copy just requirements.txt into the Actor image, # since it should be the only file that affects the dependency install in the next step, # in order to speed up the build COPY requirements.txt ./ # Install the packages specified in requirements.txt, # Print the installed Python version, pip version # and all installed packages with their versions for debugging RUN echo "Python version:" \ && python --version \ && echo "Pip version:" \ && pip --version \ && echo "Installing dependencies:" \ && pip install -r requirements.txt \ && echo "All installed Python packages:" \ && pip freeze # Install Playwright and its dependencies RUN playwright install-deps && \ playwright install # Next, copy the remaining files and directories with the source code. # Since we do this after installing the dependencies, quick build will be really fast # for most source file changes. COPY . ./ # Use compileall to ensure the runnability of the Actor Python code. RUN python3 -m compileall -q src/ # Specify how to launch the source code of your Actor. # By default, the "python3 -m src" command is run CMD ["python3", "-m", "src"]

1# Feel free to add your Python dependencies below. For formatting guidelines, see: 2# https://pip.pypa.io/en/latest/reference/requirements-file-format/ 3 4# Apify SDK - Latest version with pay-per-event support 5apify[playwright] >= 3.0.0, < 4.0.0 6 7# Crawlee - Apify's recommended scraping framework with anti-detection 8crawlee[playwright] >= 1.0.0 9 10# Async HTTP client for lead enrichment 11aiohttp >= 3.9.0 12 13# Additional useful libraries for robust scraping 14requests >= 2.31.0 15 16# JSON handling - faster JSON processing 17ujson >= 5.8.0

.git .mise.toml .nvim.lua storage # The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: .python-version # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. #pdm.lock # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it # in version control. # https://pdm.fming.dev/latest/usage/project/#working-with-version-control .pdm.toml .pdm-python .pdm-build/ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. .idea/ # Visual Studio Code # Ignores the folder created by VS Code when changing workspace settings, doing debugger # configuration, etc. Can be commented out to share Workspace Settings within a team .vscode # Zed editor # Ignores the folder created when setting Project Settings in the Zed editor. Can be commented out # to share Project Settings within a team .zed

{ "title": "🌐📍 LeadLocator Pro - Google Maps Lead Generator", "description": "Fast & affordable Google Maps scraper. Extract business names, phones, emails, websites, ratings & reviews. Free tier: First 50 results FREE! Pay-per-result: $4.50/1,000 results. Perfect for lead generation, sales prospecting & market research.", "type": "object", "schemaVersion": 1, "properties": { "searches": { "title": "Search Queries", "type": "array", "description": "List of search queries with their locations. Each search will be processed separately. TIP: Start with 1 search for testing, then scale up!", "editor": "json", "prefill": [ { "query": "coffee shops", "location": "New York, NY", "max_results": 20 } ], "minItems": 1, "maxItems": 20, "items": { "type": "object", "properties": { "query": { "type": "string", "title": "Search Query", "description": "What to search for (e.g., 'law firms', 'real estate agents', 'dentists', 'auto repair shops')", "minLength": 2, "maxLength": 100 }, "location": { "type": "string", "title": "Location", "description": "Where to search (e.g., 'Miami, FL', 'downtown Austin', '10001')", "maxLength": 100 }, "max_results": { "type": "integer", "title": "Max Results", "description": "Maximum number of businesses to extract for this search", "minimum": 1, "maximum": 200 } }, "required": ["query"], "additionalProperties": false } }, "max_results_per_search": { "title": "Default Max Results Per Search", "type": "integer", "description": "Default maximum number of businesses to extract per search (can be overridden in individual searches). FREE TIER: First 50 results are FREE!", "default": 20, "minimum": 1, "maximum": 500 }, "use_proxies": { "title": "Use Apify Proxies", "type": "boolean", "description": "Enable Apify's proxy rotation for better reliability and avoiding IP blocks", "default": true }, "proxy_country": { "title": "Proxy Country", "type": "string", "description": "Country code for proxy servers (use country where your target businesses are located)", "editor": "select", "default": "US", "enum": ["US", "GB", "CA", "AU", "DE", "FR", "IT", "ES", "JP", "BR", "IN", "MX", "NL", "SE", "CH"] }, "extract_emails": { "title": "📧 Extract Emails from Websites (PREMIUM - Slower)", "type": "boolean", "description": "⚠️ SLOW & EXPENSIVE: Visits each business website to find emails (adds 5-10 sec per business). Recommended: Start with FALSE, then enable for specific high-value searches. Creates premium leads with phone + email.", "default": false }, "minimum_rating": { "title": "Minimum Rating Filter", "type": "integer", "description": "Only include businesses with ratings above this threshold (0 = include all, values are in tenths - e.g., 35 = 3.5 stars)", "default": 0, "minimum": 0, "maximum": 50 }, "minimum_reviews": { "title": "Minimum Reviews Filter", "type": "integer", "description": "Only include businesses with at least this many reviews (0 = include all)", "default": 0, "minimum": 0, "maximum": 10000 }, "delay_between_requests": { "title": "Delay Between Searches", "type": "integer", "description": "Delay in seconds between processing different searches", "default": 2, "minimum": 1, "maximum": 60 }, "scroll_delay": { "title": "Scroll Delay", "type": "integer", "description": "Delay in milliseconds between scrolling actions to load more results", "default": 1500, "minimum": 1000, "maximum": 10000 }, "include_closed_businesses": { "title": "Include Closed Businesses", "type": "boolean", "description": "Include businesses that are temporarily or permanently closed", "default": true }, "output_format": { "title": "Output Data Format", "type": "string", "description": "Choose how to structure the output data", "editor": "select", "default": "detailed", "enum": ["detailed", "compact", "csv_ready"], "enumTitles": ["Detailed (all available fields)", "Compact (essential fields only)", "CSV Ready (flat structure)"] }, "enable_debug_logs": { "title": "Enable Debug Logging", "type": "boolean", "description": "Enable detailed logging for troubleshooting (may increase log size)", "default": false }, "custom_user_agent": { "title": "Custom User Agent", "type": "string", "description": "Specify a custom User-Agent string (leave empty for automatic selection)", "editor": "textfield", "maxLength": 500 }, "timeout_per_search": { "title": "Timeout Per Search (Minutes)", "type": "integer", "description": "Maximum time in minutes to spend on each search. Default settings complete in ~3 minutes. Email extraction may require 5-8 minutes.", "default": 5, "minimum": 2, "maximum": 15 }, "enhanced_extraction": { "title": "📞 Enhanced Phone & Website Extraction (PREMIUM - Slower)", "type": "boolean", "description": "⚠️ SLOWER: Visits individual business pages for accurate phone numbers & websites (adds 2-3 sec per business). Recommended: Enable for serious lead generation. Targets high-value businesses first.", "default": false }, "verify_emails": { "title": "✅ Email Verification (+$0.008/lead)", "type": "boolean", "description": "Verify extracted emails are deliverable. Checks MX records, disposable domains, and email validity. Returns quality score for each email. Essential for cold outreach campaigns.", "default": false }, "find_social_profiles": { "title": "🔗 Social Media Finder (+$0.012/lead)", "type": "boolean", "description": "Find LinkedIn, Facebook, Twitter, Instagram profiles for each business. Scans business websites for social links. Great for social selling and research.", "default": false }, "score_leads": { "title": "🎯 AI Lead Scoring (+$0.005/lead)", "type": "boolean", "description": "AI-powered lead quality scoring (0-100). Analyzes contact info, reputation, industry value. Returns A/B/C/D tier classification. Prioritize your outreach with data-driven insights.", "default": false }, "premium_enrichment": { "title": "💎 PREMIUM: Full Lead Intelligence (+$0.035/lead)", "type": "boolean", "description": "🔥 BEST VALUE: Enables ALL premium features (email verification + social profiles + lead scoring). Creates fully enriched, sales-ready leads. Perfect for agencies and serious prospecting.", "default": false } }, "required": ["searches"], "additionalProperties": false }

{ "actorSpecification": 1, "name": "leadlocator-pro", "title": "Google Maps Scraper - Business Leads with Emails & Phone Numbers", "description": "Extract business leads from Google Maps with verified emails, phone numbers, and social profiles. The most complete Google Maps lead scraper with AI lead scoring, email verification, and LinkedIn/Facebook finder. Perfect for B2B sales, cold outreach, and local business data extraction. Keywords: google maps scraper, business leads extractor, local business data, email finder.", "version": "3.0", "buildTag": "latest", "categories": ["BUSINESS", "ECOMMERCE", "MARKETING"], "input": "./input_schema.json", "environmentVariables": { "APIFY_HEADLESS": "1" }, "meta": { "templateId": "python-playwright" }, "payPerEvent": { "events": [ { "name": "basic_lead", "priceUsd": 0.01, "description": "Basic lead (name, address, phone, website, rating)" }, { "name": "enriched_lead", "priceUsd": 0.03, "description": "Enriched lead with email, social profiles, and AI scoring" } ] } }

1""" 2Input validation module for LeadLocator Pro 3Validates actor input and provides helpful error messages 4""" 5 6from typing import Any, Dict, List, Optional, Tuple 7from apify import Actor 8 9 10def validate_actor_input(actor_input: Dict[str, Any]) -> Tuple[bool, Optional[str]]: 11 """ 12 Validate actor input and return (is_valid, error_message) 13 14 Args: 15 actor_input: The input dictionary from Actor.get_input() 16 17 Returns: 18 Tuple of (is_valid: bool, error_message: Optional[str]) 19 """ 20 21 # Check if searches is provided 22 searches = actor_input.get('searches') 23 24 # Handle both array and single search formats 25 if not searches: 26 # Check for legacy single search format 27 query = actor_input.get('query') 28 if not query: 29 return False, "No search queries provided. Expected format: {'searches': [{'query': '...', 'location': '...'}]}" 30 # Single search format is valid - will be converted later 31 return True, None 32 33 if not isinstance(searches, list): 34 return False, f"'searches' must be a list, got {type(searches).__name__}" 35 36 if len(searches) == 0: 37 return False, "'searches' array cannot be empty" 38 39 # Validate each search item 40 for idx, search in enumerate(searches): 41 if not isinstance(search, dict): 42 return False, f"Search item {idx} must be a dictionary, got {type(search).__name__}" 43 44 # Check for required 'query' field 45 query = search.get('query') 46 if not query: 47 return False, f"Search item {idx} is missing required 'query' field" 48 49 if not isinstance(query, str): 50 return False, f"Search item {idx}: 'query' must be a string, got {type(query).__name__}" 51 52 if len(query.strip()) == 0: 53 return False, f"Search item {idx}: 'query' cannot be empty" 54 55 # Validate optional location field 56 location = search.get('location', '') 57 if location and not isinstance(location, str): 58 return False, f"Search item {idx}: 'location' must be a string, got {type(location).__name__}" 59 60 # Validate optional max_results field 61 max_results = search.get('max_results') 62 if max_results is not None: 63 if not isinstance(max_results, int): 64 return False, f"Search item {idx}: 'max_results' must be an integer, got {type(max_results).__name__}" 65 if max_results < 1: 66 return False, f"Search item {idx}: 'max_results' must be >= 1, got {max_results}" 67 if max_results > 200: 68 return False, f"Search item {idx}: 'max_results' cannot exceed 200, got {max_results}" 69 70 # Validate numeric parameters 71 max_results_per_search = actor_input.get('max_results_per_search') 72 if max_results_per_search is not None: 73 if not isinstance(max_results_per_search, int): 74 return False, f"'max_results_per_search' must be an integer, got {type(max_results_per_search).__name__}" 75 if max_results_per_search < 1 or max_results_per_search > 200: 76 return False, f"'max_results_per_search' must be between 1 and 200, got {max_results_per_search}" 77 78 minimum_rating = actor_input.get('minimum_rating') 79 if minimum_rating is not None: 80 if not isinstance(minimum_rating, (int, float)): 81 return False, f"'minimum_rating' must be a number, got {type(minimum_rating).__name__}" 82 if minimum_rating < 0 or minimum_rating > 50: 83 return False, f"'minimum_rating' must be between 0 and 50 (in tenths), got {minimum_rating}" 84 85 minimum_reviews = actor_input.get('minimum_reviews') 86 if minimum_reviews is not None: 87 if not isinstance(minimum_reviews, int): 88 return False, f"'minimum_reviews' must be an integer, got {type(minimum_reviews).__name__}" 89 if minimum_reviews < 0: 90 return False, f"'minimum_reviews' cannot be negative, got {minimum_reviews}" 91 92 delay_between_requests = actor_input.get('delay_between_requests') 93 if delay_between_requests is not None: 94 if not isinstance(delay_between_requests, (int, float)): 95 return False, f"'delay_between_requests' must be a number, got {type(delay_between_requests).__name__}" 96 if delay_between_requests < 0: 97 return False, f"'delay_between_requests' cannot be negative, got {delay_between_requests}" 98 99 timeout_per_search = actor_input.get('timeout_per_search') 100 if timeout_per_search is not None: 101 if not isinstance(timeout_per_search, (int, float)): 102 return False, f"'timeout_per_search' must be a number, got {type(timeout_per_search).__name__}" 103 if timeout_per_search < 1: 104 return False, f"'timeout_per_search' must be at least 1 minute, got {timeout_per_search}" 105 if timeout_per_search > 60: 106 return False, f"'timeout_per_search' cannot exceed 60 minutes, got {timeout_per_search}" 107 108 scroll_delay = actor_input.get('scroll_delay') 109 if scroll_delay is not None: 110 if not isinstance(scroll_delay, (int, float)): 111 return False, f"'scroll_delay' must be a number, got {type(scroll_delay).__name__}" 112 if scroll_delay < 0: 113 return False, f"'scroll_delay' cannot be negative, got {scroll_delay}" 114 115 # Validate boolean parameters 116 for bool_param in ['use_proxies', 'extract_emails', 'enhanced_extraction', 117 'include_closed_businesses', 'enable_debug_logs']: 118 value = actor_input.get(bool_param) 119 if value is not None and not isinstance(value, bool): 120 return False, f"'{bool_param}' must be a boolean, got {type(value).__name__}" 121 122 # Validate output_format 123 output_format = actor_input.get('output_format') 124 if output_format is not None: 125 if not isinstance(output_format, str): 126 return False, f"'output_format' must be a string, got {type(output_format).__name__}" 127 valid_formats = ['detailed', 'compact', 'csv_ready'] 128 if output_format not in valid_formats: 129 return False, f"'output_format' must be one of {valid_formats}, got '{output_format}'" 130 131 # Validate proxy_country 132 proxy_country = actor_input.get('proxy_country') 133 if proxy_country is not None: 134 if not isinstance(proxy_country, str): 135 return False, f"'proxy_country' must be a string, got {type(proxy_country).__name__}" 136 valid_countries = ['US', 'GB', 'CA', 'AU', 'DE', 'FR', 'IT', 'ES', 'JP', 'BR', 'IN', 'MX', 'NL', 'SE', 'CH'] 137 if proxy_country.upper() not in valid_countries: 138 return False, f"'proxy_country' must be one of {valid_countries}, got '{proxy_country}'" 139 140 # Validate custom_user_agent (sanitize against injection) 141 custom_user_agent = actor_input.get('custom_user_agent', '') 142 if custom_user_agent: 143 if not isinstance(custom_user_agent, str): 144 return False, f"'custom_user_agent' must be a string, got {type(custom_user_agent).__name__}" 145 # Check for suspicious characters that could indicate header injection 146 if '\n' in custom_user_agent or '\r' in custom_user_agent: 147 return False, "'custom_user_agent' contains invalid characters (newline/carriage return)" 148 if len(custom_user_agent) > 500: 149 return False, f"'custom_user_agent' is too long (max 500 characters), got {len(custom_user_agent)}" 150 151 return True, None 152 153 154def sanitize_user_agent(user_agent: str) -> str: 155 """ 156 Sanitize user agent string to prevent header injection 157 158 Args: 159 user_agent: The user agent string to sanitize 160 161 Returns: 162 Sanitized user agent string 163 """ 164 if not user_agent: 165 return '' 166 167 # Remove any newlines, carriage returns, and other control characters 168 sanitized = user_agent.replace('\n', '').replace('\r', '').replace('\t', ' ') 169 170 # Limit length 171 sanitized = sanitized[:500] 172 173 # Remove any non-printable characters 174 sanitized = ''.join(char for char in sanitized if char.isprintable() or char == ' ') 175 176 return sanitized.strip() 177 178 179def validate_extracted_data(business: Dict[str, Any]) -> bool: 180 """ 181 Validate that extracted business data meets minimum quality standards 182 183 Args: 184 business: Business data dictionary 185 186 Returns: 187 True if data is valid, False otherwise 188 """ 189 # Must have a name 190 if not business.get('name') or len(business['name'].strip()) < 2: 191 return False 192 193 # Must have at least one of: address, phone, website, email 194 has_contact = any([ 195 business.get('address'), 196 business.get('phone'), 197 business.get('website'), 198 business.get('email'), 199 ]) 200 201 return has_contact 202 203 204async def validate_and_fail_if_invalid(actor_input: Dict[str, Any]) -> None: 205 """ 206 Validate input and fail the actor with helpful error message if invalid 207 208 Args: 209 actor_input: The input dictionary from Actor.get_input() 210 211 Raises: 212 Calls Actor.fail() if validation fails 213 """ 214 is_valid, error_message = validate_actor_input(actor_input) 215 216 if not is_valid: 217 Actor.log.error(f"❌ Input validation failed: {error_message}") 218 Actor.log.error("📋 Expected input format:") 219 Actor.log.error(""" 220{ 221 "searches": [ 222 { 223 "query": "coffee shops", // REQUIRED 224 "location": "Seattle, WA", // Optional 225 "max_results": 50 // Optional 226 } 227 ], 228 "max_results_per_search": 50, // Optional (1-200) 229 "extract_emails": true, // Optional (boolean) 230 "enhanced_extraction": true, // Optional (boolean) 231 "minimum_rating": 35, // Optional (0-50, in tenths) 232 "minimum_reviews": 10, // Optional (integer >= 0) 233 "use_proxies": true, // Optional (boolean) 234 "timeout_per_search": 10 // Optional (1-60 minutes) 235} 236 """) 237 238 await Actor.fail( 239 status_message=f"Input validation failed: {error_message}", 240 exit_code=1 241 ) 242 raise ValueError(error_message)

1""" 2Configuration constants for LeadLocator Pro 3Centralizes all magic numbers for easy tuning and maintenance 4""" 5 6# ============================================ 7# EXTRACTION PHASE BUDGETS 8# ============================================ 9 10# Phase 2: Enhanced extraction (phone + website) 11MAX_ENHANCED_BUSINESSES = 15 # Number of high-value businesses to enhance 12 13# Phase 3: Email extraction 14MAX_EMAIL_EXTRACTIONS = 10 # Number of businesses to extract emails from 15EMAIL_RETRY_ATTEMPTS = 2 # Number of retry attempts for email extraction 16 17# ============================================ 18# TIMEOUTS (milliseconds) 19# ============================================ 20 21# Navigation and page loading 22NAVIGATION_TIMEOUT_MS = 30000 # Timeout for page.goto() 23INITIAL_PAGE_WAIT_MS = 3000 # Initial wait after navigation 24SELECTOR_WAIT_MS = 10000 # Timeout for wait_for_selector() 25 26# Scrolling and loading delays 27SCROLL_DELAY_MS = 800 # Delay after scrolling 28SCROLL_MIN_DELAY_MS = 500 # Minimum random delay 29SCROLL_MAX_DELAY_MS = 1500 # Maximum random delay 30 31# Email extraction timeouts 32MIN_EMAIL_TIMEOUT_MS = 3000 # Minimum timeout per email extraction 33MAX_EMAIL_TIMEOUT_MS = 6000 # Maximum timeout per email extraction 34EMAIL_QUICK_WAIT_MS = 500 # Quick wait after email page load 35EMAIL_RETRY_DELAY_MS = 1000 # Delay between email retry attempts 36 37# Phone/website extraction 38PHONE_EXTRACTION_WAIT_MS = 1500 # Wait time on business page 39BUSINESS_PAGE_TIMEOUT_MS = 10000 # Timeout for business page navigation 40ENHANCEMENT_RETURN_DELAY_MS = 500 # Delay after going back to search results 41 42# Random delays between operations 43MIN_OPERATION_DELAY_MS = 500 # Minimum delay between operations 44MAX_OPERATION_DELAY_MS = 1000 # Maximum delay between operations 45NAV_RETRY_MIN_DELAY_MS = 2000 # Minimum delay before navigation retry 46NAV_RETRY_MAX_DELAY_MS = 5000 # Maximum delay before navigation retry 47 48# ============================================ 49# SCROLL SETTINGS 50# ============================================ 51 52MAX_SCROLL_ATTEMPTS = 5 # Maximum number of scroll attempts 53NO_NEW_RESULTS_THRESHOLD = 3 # Number of scrolls with no new results before stopping 54 55# ============================================ 56# TIMEOUT PERCENTAGES 57# ============================================ 58 59# When to stop enhancement/email extraction based on remaining time 60ENHANCEMENT_TIMEOUT_THRESHOLD = 0.85 # Stop enhancement at 85% of timeout 61EMAIL_TIMEOUT_THRESHOLD = 0.95 # Stop email extraction at 95% of timeout 62 63# ============================================ 64# EXTRACTION LIMITS 65# ============================================ 66 67# QA mode detection thresholds 68QA_MODE_MAX_RESULTS = 30 # If total expected results <= this, enable QA mode 69QA_MODE_MAX_TIMEOUT_SECONDS = 300 # If timeout <= 5 minutes, enable QA mode 70QA_MODE_MAX_BUSINESSES = 20 # Limit results in QA mode 71 72# Maximum results 73DEFAULT_MAX_RESULTS_PER_SEARCH = 50 74ABSOLUTE_MAX_RESULTS = 200 # Hard limit per search 75 76# ============================================ 77# RETRY SETTINGS 78# ============================================ 79 80MAX_NAVIGATION_RETRIES = 2 # Number of navigation retry attempts 81MAX_REQUEST_RETRIES = 2 # Crawlee max request retries 82PER_BUSINESS_MAX_TIME_SECONDS = 15 # Max time to spend on single business enhancement 83 84# ============================================ 85# BUSINESS FILTERING 86# ============================================ 87 88# Minimum lengths for valid data 89MIN_BUSINESS_NAME_LENGTH = 2 90MIN_CATEGORY_LENGTH = 10 91MAX_CATEGORY_LENGTH = 60 92 93# ============================================ 94# SELECTOR CATEGORIES 95# ============================================ 96 97# Business container selectors (Google Maps search results) 98BUSINESS_CONTAINER_SELECTORS = [ 99 'div.Nv2PK a', # Primary business container with link 100 'div.tH5CWc a', # Alternative container 101 'div.THOPZb a', # Another container variant 102 'a.hfpxzc' # Direct business link class 103] 104 105# Business name selectors 106NAME_SELECTORS = [ 107 '.qBF1Pd', # Current primary name class 108 '[class*="fontHeadlineSmall"]', # Fallback 109 '[aria-level="3"]', # Accessibility fallback 110 'h3' # HTML fallback 111] 112 113# Rating selector 114RATING_SELECTOR = '.MW4etd' 115 116# Review count selector 117REVIEW_COUNT_SELECTOR = '.UY7F9' 118 119# Price level selector 120PRICE_LEVEL_SELECTOR = '.wcldff.fontHeadlineSmall.Cbys4b' 121 122# Phone number selectors (business page) 123PHONE_SELECTORS = [ 124 'button[data-item-id*="phone"] span', 125 '[data-item-id*="phone"] span', 126 'button[aria-label*="phone" i]', 127 'a[href^="tel:"]', 128 '[role="main"] button[jsaction*="phone"]', 129 '[data-tooltip*="phone" i]', 130 'button[aria-label*="Call" i]', 131] 132 133# Website selectors (business page) 134WEBSITE_SELECTORS = [ 135 '[data-item-id*="authority"] a', 136 'a[data-item-id*="authority"]', 137 '[role="main"] a[href^="http"]:not([href*="google"])', 138 'button[data-item-id*="authority"]', 139 'a[aria-label*="Website" i]', 140 '[data-tooltip*="website" i]', 141] 142 143# Email extraction selectors 144EMAIL_LINK_SELECTORS = [ 145 'a[href^="mailto:"]', 146 'a[href*="@"]', 147 '[class*="email" i] a', 148 '[class*="contact" i] a[href^="mailto:"]', 149] 150 151# Contact page link patterns 152CONTACT_PAGE_PATTERNS = [ 153 'contact', 154 'about', 155 'reach-us', 156 'get-in-touch', 157 'contact-us', 158] 159 160# Scrollable container selectors 161SCROLL_CONTAINER_SELECTORS = [ 162 '[role="feed"]', # Primary results feed 163 'div[style*="overflow-y: scroll"]', # Scrollable div 164 'div[aria-label*="Results"]', # Results container by aria-label 165 '.m6QErb .DxyBCb', # Legacy fallback 166 '[data-value="Search results"]' # Data attribute approach 167] 168 169# Results detection selectors 170RESULTS_DETECTION_SELECTORS = [ 171 '[role="feed"]', 172 'a[href*="/maps/place/"]', 173 'div[jsaction*="mouseover"]', 174 '[data-result-index]', 175 'div[role="article"]' 176] 177 178# ============================================ 179# REGEX PATTERNS 180# ============================================ 181 182# Email regex pattern 183EMAIL_REGEX_PATTERN = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' 184 185# Email filtering (domains/prefixes to exclude) 186EMAIL_EXCLUDE_PATTERNS = [ 187 'example.com', 188 'noreply', 189 'no-reply', 190 'donotreply', 191 'admin@', 192 'test@', 193 'info@example', 194 'contact@example', 195] 196 197# Phone number patterns (international support) 198PHONE_PATTERNS = { 199 'us': r'$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}', # (123) 456-7890 or 123-456-7890 200 'us_plus1': r'\+1[-.\s]?$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}', # +1 123-456-7890 201 'international': r'[\+]?[(]?[\d\s\-]{10,}', # General international format 202 'uk': r'(\+44\s?7\d{3}|$?07\d{3}$?)\s?\d{3}\s?\d{3}', # UK mobile 203 'au': r'(\+61\s?|$)?0?[2-478](\s?|-)?\d{4}(\s?|-)?\d{4}', # Australian 204 'ca': r'\+1[-.\s]?\(?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}', # Canadian (same as US) 205 'generic': r'[\+]?[(]?[0-9]{1,4}[)]?[-\s\.]?[(]?[0-9]{1,4}[)]?[-\s\.]?[0-9]{1,5}[-\s\.]?[0-9]{1,5}', 206} 207 208# Address patterns (for detection in search results) 209ADDRESS_KEYWORDS = [ 210 'Street', 'Boulevard', 'Avenue', 'St', 'Blvd', 'Ave', 211 'Drive', 'Dr', 'Road', 'Rd', 'Lane', 'Ln', 'Way', 'Plaza', 212 'Court', 'Ct', 'Circle', 'Cir', 'Parkway', 'Pkwy' 213] 214 215# ============================================ 216# BROWSER SETTINGS 217# ============================================ 218 219VIEWPORT_WIDTH = 1920 220VIEWPORT_HEIGHT = 1080 221 222# ============================================ 223# LOGGING EMOJIS 224# ============================================ 225 226EMOJI_START = '🚀' 227EMOJI_SUCCESS = '✅' 228EMOJI_ERROR = '❌' 229EMOJI_WARNING = '⚠️' 230EMOJI_INFO = '📊' 231EMOJI_SEARCH = '🔍' 232EMOJI_PHONE = '📞' 233EMOJI_EMAIL = '📧' 234EMOJI_WEBSITE = '🌐' 235EMOJI_HIGH_VALUE = '💎' 236EMOJI_PREMIUM = '🔥' 237EMOJI_TIME = '⏱️' 238EMOJI_SCROLL = '📈' 239EMOJI_END = '🔚' 240EMOJI_EXTRACTION = '📋' 241 242# ============================================ 243# OUTPUT FORMAT TEMPLATES 244# ============================================ 245 246# Fields to include in compact output format 247COMPACT_OUTPUT_FIELDS = [ 248 'name', 249 'phone', 250 'email', 251 'website', 252 'rating', 253 'address', 254] 255 256# Fields to include in CSV-ready output format 257CSV_READY_FIELDS = [ 258 'name', 259 'phone', 260 'email', 261 'website', 262 'address', 263 'category', 264 'rating', 265 'review_count', 266 'maps_url', 267]

1""" 2Google Maps Lead Generation Scraper for Apify 3Uses Crawlee for better anti-detection and Apify's built-in proxy management 4""" 5 6from __future__ import annotations 7 8import asyncio 9import json 10import random 11import re 12from datetime import datetime, timedelta 13from typing import Any, Dict, List, Optional 14from urllib.parse import quote 15 16from apify import Actor 17from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext 18from crawlee import Request 19 20# Import our constants and validators 21from . import constants as const 22from .validators import validate_and_fail_if_invalid, validate_extracted_data, sanitize_user_agent 23from .lead_enrichment import LeadEnrichmentPipeline, enrich_leads_batch 24 25 26class GoogleMapsLeadScraper: 27 def __init__(self): 28 self.results = [] 29 self.processed_businesses = set() 30 31 # Extraction success tracking 32 self.stats = { 33 'email_attempts': 0, 34 'email_successes': 0, 35 'email_failures': 0, 36 'phone_attempts': 0, 37 'phone_successes': 0, 38 'phone_failures': 0, 39 'website_attempts': 0, 40 'website_successes': 0, 41 'website_failures': 0, 42 } 43 44 async def extract_email_from_website(self, website_url: str, page, timeout_ms: int = 5000, retry_count: int = 0) -> str: 45 """ 46 Extract email from business website with retry logic and multiple fallback methods 47 48 Args: 49 website_url: The URL to extract email from 50 page: Playwright page object 51 timeout_ms: Timeout in milliseconds 52 retry_count: Current retry attempt number 53 54 Returns: 55 Extracted email address or empty string 56 """ 57 if not website_url or not website_url.startswith('http'): 58 return '' 59 60 self.stats['email_attempts'] += 1 61 62 try: 63 await page.goto(website_url, wait_until='domcontentloaded', timeout=timeout_ms) 64 await page.wait_for_timeout(const.EMAIL_QUICK_WAIT_MS) 65 66 # Method 1: Check for mailto: links first (most reliable) 67 email = await page.evaluate(f'''() => {{ 68 const mailtoLinks = Array.from(document.querySelectorAll('{", ".join(const.EMAIL_LINK_SELECTORS)}')); 69 for (const link of mailtoLinks) {{ 70 const href = link.href || link.getAttribute('href') || ''; 71 if (href.startsWith('mailto:')) {{ 72 const email = href.replace('mailto:', '').split('?')[0]; 73 if (email && email.includes('@')) {{ 74 return email; 75 }} 76 }} 77 }} 78 return ''; 79 }}''') 80 81 if email: 82 self.stats['email_successes'] += 1 83 return email 84 85 # Method 2: Extract emails from page content using regex 86 email = await page.evaluate(f'''() => {{ 87 const emailRegex = /{const.EMAIL_REGEX_PATTERN}/g; 88 const pageText = document.body.textContent || ''; 89 const emails = pageText.match(emailRegex) || []; 90 91 // Filter out common non-business emails 92 const excludePatterns = {json.dumps(const.EMAIL_EXCLUDE_PATTERNS)}; 93 const filtered = emails.filter(email => {{ 94 const lower = email.toLowerCase(); 95 return !excludePatterns.some(pattern => lower.includes(pattern)); 96 }}); 97 98 return filtered[0] || ''; 99 }}''') 100 101 if email: 102 self.stats['email_successes'] += 1 103 return email 104 105 # Method 3: Try to find and navigate to contact page 106 contact_link = await page.evaluate(f'''() => {{ 107 const contactPatterns = {json.dumps(const.CONTACT_PAGE_PATTERNS)}; 108 const links = Array.from(document.querySelectorAll('a')); 109 110 for (const link of links) {{ 111 const text = (link.textContent || '').toLowerCase(); 112 const href = (link.href || '').toLowerCase(); 113 114 for (const pattern of contactPatterns) {{ 115 if (text.includes(pattern) || href.includes(pattern)) {{ 116 return link.href; 117 }} 118 }} 119 }} 120 return ''; 121 }}''') 122 123 if contact_link: 124 try: 125 await page.goto(contact_link, wait_until='domcontentloaded', timeout=timeout_ms) 126 await page.wait_for_timeout(const.EMAIL_QUICK_WAIT_MS) 127 128 # Try extracting email from contact page 129 email = await page.evaluate(f'''() => {{ 130 const emailRegex = /{const.EMAIL_REGEX_PATTERN}/g; 131 const pageText = document.body.textContent || ''; 132 const emails = pageText.match(emailRegex) || []; 133 134 const excludePatterns = {json.dumps(const.EMAIL_EXCLUDE_PATTERNS)}; 135 const filtered = emails.filter(email => {{ 136 const lower = email.toLowerCase(); 137 return !excludePatterns.some(pattern => lower.includes(pattern)); 138 }}); 139 140 return filtered[0] || ''; 141 }}''') 142 143 if email: 144 self.stats['email_successes'] += 1 145 return email 146 except Exception as contact_error: 147 Actor.log.debug(f"Contact page navigation failed: {contact_error}") 148 149 # If no email found and retries available, try again 150 if retry_count < const.EMAIL_RETRY_ATTEMPTS: 151 Actor.log.debug(f"Email extraction failed, retrying ({retry_count + 1}/{const.EMAIL_RETRY_ATTEMPTS})...") 152 await page.wait_for_timeout(const.EMAIL_RETRY_DELAY_MS) 153 return await self.extract_email_from_website(website_url, page, timeout_ms, retry_count + 1) 154 155 self.stats['email_failures'] += 1 156 return '' 157 158 except Exception as e: 159 Actor.log.debug(f"Email extraction error: {e}") 160 self.stats['email_failures'] += 1 161 return '' 162 163 async def extract_phone_from_business_page(self, page, business_url: str) -> str: 164 """Extract phone number from individual business page with international format support""" 165 self.stats['phone_attempts'] += 1 166 167 try: 168 await page.goto(business_url, wait_until='domcontentloaded', timeout=const.BUSINESS_PAGE_TIMEOUT_MS) 169 await page.wait_for_timeout(const.PHONE_EXTRACTION_WAIT_MS) 170 171 # Look for phone number using multiple selectors and international patterns 172 phone = await page.evaluate(f'''() => {{ 173 // Current LeadLocator Pro business page phone selectors 174 const phoneSelectors = {json.dumps(const.PHONE_SELECTORS)}; 175 176 for (const selector of phoneSelectors) {{ 177 const phoneEl = document.querySelector(selector); 178 if (phoneEl) {{ 179 const phoneText = phoneEl.textContent || phoneEl.getAttribute('aria-label') || phoneEl.href || ''; 180 // Extract phone number pattern (international support) 181 const phoneMatch = phoneText.match(/[\\+]?[(]?[\\d\\s\\-\$\$]{{10,}}/); 182 if (phoneMatch) {{ 183 return phoneMatch[0].trim(); 184 }} 185 }} 186 }} 187 188 // Fallback: look for phone patterns in page text (international formats) 189 const pageText = document.body.textContent || ''; 190 191 // Try multiple international phone patterns 192 const patterns = [ 193 /\\+1[-\\.\\s]?\$?\\d{{3}}\$?[-\\.\\s]?\\d{{3}}[-\\.\\s]?\\d{{4}}/g, // US/CA with +1 194 /\$?\\d{{3}}\$?[-\\.\\s]?\\d{{3}}[-\\.\\s]?\\d{{4}}/g, // US/CA format 195 /\\+44\\s?7\\d{{3}}\\s?\\d{{3}}\\s?\\d{{3}}/g, // UK mobile 196 /\\+61\\s?\\d{{1}}\\s?\\d{{4}}\\s?\\d{{4}}/g, // Australian 197 /\\+\\d{{1,3}}[-\\.\\s]?\$?\\d{{1,4}}\$?[-\\.\\s]?\\d{{1,5}}[-\\.\\s]?\\d{{1,5}}/g, // Generic international 198 ]; 199 200 for (const pattern of patterns) {{ 201 const phoneMatch = pageText.match(pattern); 202 if (phoneMatch) {{ 203 return phoneMatch[0].trim(); 204 }} 205 }} 206 207 return ''; 208 }}''') 209 210 if phone: 211 self.stats['phone_successes'] += 1 212 else: 213 self.stats['phone_failures'] += 1 214 215 return phone 216 217 except Exception as e: 218 Actor.log.debug(f"Phone extraction failed: {e}") 219 self.stats['phone_failures'] += 1 220 return '' 221 222 async def extract_website_from_business_page(self, page) -> str: 223 """Extract website URL from business page with enhanced selectors""" 224 self.stats['website_attempts'] += 1 225 226 try: 227 website = await page.evaluate(f'''() => {{ 228 const websiteSelectors = {json.dumps(const.WEBSITE_SELECTORS)}; 229 230 for (const selector of websiteSelectors) {{ 231 const websiteEl = document.querySelector(selector); 232 if (websiteEl && websiteEl.href && !websiteEl.href.includes('google.com')) {{ 233 return websiteEl.href; 234 }} 235 }} 236 return ''; 237 }}''') 238 239 if website: 240 self.stats['website_successes'] += 1 241 else: 242 self.stats['website_failures'] += 1 243 244 return website 245 246 except Exception as e: 247 Actor.log.debug(f"Website extraction failed: {e}") 248 self.stats['website_failures'] += 1 249 return '' 250 251 252 async def extract_businesses_from_page(self, page) -> List[Dict]: 253 """Extract business listings using 2025-current 🌐📍 LeadLocator Pro selectors""" 254 return await page.evaluate('''() => { 255 const results = []; 256 257 // 2025 Current 🌐📍 LeadLocator Pro business container selectors 258 const businessContainerSelectors = [ 259 'div.Nv2PK a', // Primary business container with link 260 'div.tH5CWc a', // Alternative container 261 'div.THOPZb a', // Another container variant 262 'a.hfpxzc' // Direct business link class 263 ]; 264 265 const businessLinks = new Set(); 266 267 // Find all business links using current selectors 268 businessContainerSelectors.forEach(selector => { 269 document.querySelectorAll(selector).forEach(el => { 270 if (el.href && el.href.includes('/maps/place/')) { 271 businessLinks.add(el); 272 } 273 }); 274 }); 275 276 console.log(`Found ${businessLinks.size} business links`); 277 278 businessLinks.forEach((element) => { 279 try { 280 // Find the business container (parent) 281 const container = element.closest('div.Nv2PK, div.tH5CWc, div.THOPZb') || element.parentElement; 282 if (!container) return; 283 284 // Extract business name using current selectors 285 let name = ''; 286 const nameSelectors = [ 287 '.qBF1Pd', // Current primary name class 288 '[class*="fontHeadlineSmall"]', // Fallback 289 '[aria-level="3"]', // Accessibility fallback 290 'h3' // HTML fallback 291 ]; 292 293 for (const selector of nameSelectors) { 294 const nameEl = container.querySelector(selector); 295 if (nameEl && nameEl.textContent.trim()) { 296 name = nameEl.textContent.trim(); 297 break; 298 } 299 } 300 301 if (!name || name.length < 2) return; 302 303 // Extract rating using current class 304 let rating = 0; 305 const ratingEl = container.querySelector('.MW4etd'); 306 if (ratingEl) { 307 const ratingText = ratingEl.textContent || ratingEl.getAttribute('aria-label') || ''; 308 const ratingMatch = ratingText.match(/([0-9][.,][0-9])/); 309 if (ratingMatch) { 310 rating = parseFloat(ratingMatch[1].replace(',', '.')); 311 } 312 } 313 314 // Extract review count using current class 315 let reviewCount = 0; 316 const reviewEl = container.querySelector('.UY7F9'); 317 if (reviewEl) { 318 const reviewText = reviewEl.textContent || ''; 319 const reviewMatch = reviewText.match(/([\\d,]+)/); 320 if (reviewMatch) { 321 reviewCount = parseInt(reviewMatch[1].replace(/,/g, '')); 322 } 323 } 324 325 // Extract address - look for address patterns in spans 326 let address = ''; 327 const spans = container.querySelectorAll('span'); 328 for (const span of spans) { 329 const text = span.textContent.trim(); 330 // Look for address patterns 331 if ((text.includes('Street') || text.includes('Boulevard') || text.includes('Avenue') || 332 text.includes('St') || text.includes('Blvd') || text.includes('Ave') || 333 text.includes('Drive') || text.includes('Dr') || text.includes('Road') || text.includes('Rd')) && 334 text.match(/\\d+/)) { 335 address = text.replace(/^[·•]\\s*/, '').trim(); // Remove leading bullet points 336 break; 337 } 338 } 339 340 // Extract category - look for business type descriptions 341 let category = ''; 342 for (const span of spans) { 343 const text = span.textContent.trim(); 344 // Look for category patterns (longer descriptive text) 345 if (text.length > 10 && text.length < 60 && 346 !text.match(/^[0-9.(),·•\\s]+$/) && // Not just numbers/symbols 347 !text.includes('AM') && !text.includes('PM') && // Not hours 348 !address.includes(text) && // Not the address 349 !text.includes('Yamedhaminiwa') && // Skip translated terms 350 text !== name) { // Not the business name 351 category = text; 352 break; 353 } 354 } 355 356 // Extract price level 357 let priceLevel = ''; 358 const priceEl = container.querySelector('.wcldff.fontHeadlineSmall.Cbys4b'); 359 if (priceEl) { 360 priceLevel = priceEl.textContent.trim(); 361 } 362 363 // Note: Phone and website are typically NOT available in search results 364 // They require individual business page navigation 365 366 results.push({ 367 name: name, 368 rating: rating, 369 review_count: reviewCount, 370 address: address, 371 category: category, 372 phone: '', // Not available in search results 373 website: '', // Not available in search results 374 price_level: priceLevel, 375 hours: '', // Not available in search results 376 maps_url: element.href, 377 extracted_at: new Date().toISOString() 378 }); 379 380 } catch (e) { 381 console.error('Error extracting business:', e); 382 } 383 }); 384 385 return results; 386 }''') 387 388 async def scroll_results_panel(self, page) -> bool: 389 """Scroll the results panel using current Google Maps structure""" 390 try: 391 scrolled = await page.evaluate('''() => { 392 // Current 🌐📍 LeadLocator Pro scrollable container selectors 393 const scrollContainerSelectors = [ 394 '[role="feed"]', // Primary results feed 395 'div[style*="overflow-y: scroll"]', // Scrollable div 396 'div[aria-label*="Results"]', // Results container by aria-label 397 '.m6QErb .DxyBCb', // Legacy fallback 398 '[data-value="Search results"]' // Data attribute approach 399 ]; 400 401 for (const selector of scrollContainerSelectors) { 402 const container = document.querySelector(selector); 403 if (container) { 404 const scrollTarget = container.parentElement || container; 405 const oldScrollTop = scrollTarget.scrollTop; 406 407 // Fast scroll to bottom 408 scrollTarget.scrollTo({ 409 top: scrollTarget.scrollHeight, 410 behavior: 'instant' // Fastest possible scroll 411 }); 412 413 const newScrollTop = scrollTarget.scrollTop; 414 console.log(`Scrolled from ${oldScrollTop} to ${newScrollTop}`); 415 return newScrollTop > oldScrollTop; 416 } 417 } 418 419 // Last resort: page scroll 420 const oldY = window.scrollY; 421 window.scrollTo({ 422 top: document.body.scrollHeight, 423 behavior: 'instant' 424 }); 425 return window.scrollY > oldY; 426 }''') 427 428 if scrolled: 429 await page.wait_for_timeout(800) # Reduced wait time 430 431 return scrolled 432 except Exception as e: 433 Actor.log.debug(f"Scroll attempt failed: {e}") 434 return False 435 436 437 438async def main() -> None: 439 """Main entry point for the 🌐📍 LeadLocator Pro scraper""" 440 441 async with Actor: 442 Actor.log.info(f'{const.EMOJI_START} LeadLocator Pro Lead Scraper starting with Crawlee...') 443 444 # Get and validate input 445 actor_input = await Actor.get_input() or {} 446 Actor.log.info(f'📥 Input received: {json.dumps(actor_input, indent=2)}') 447 448 # CRITICAL: Validate input before processing 449 await validate_and_fail_if_invalid(actor_input) 450 451 searches = actor_input.get('searches', []) 452 max_results = actor_input.get('max_results_per_search', 20) # Reduced from 50 to 20 for faster completion 453 use_proxies = actor_input.get('use_proxies', True) # Enable by default with Apify proxies 454 extract_emails = actor_input.get('extract_emails', False) # 🔧 DISABLED by default (saves 80% compute time) 455 enhanced_extraction = actor_input.get('enhanced_extraction', False) # 🔧 DISABLED by default (saves 60% compute time) 456 delay_between_requests = actor_input.get('delay_between_requests', 2) # Reduced default 457 minimum_rating = actor_input.get('minimum_rating', 0) / 10.0 # Convert from tenths to decimal 458 minimum_reviews = actor_input.get('minimum_reviews', 0) 459 timeout_per_search = actor_input.get('timeout_per_search', 5) * 60 # 🔧 Reduced from 10min to 5min (meet Apify requirement) 460 461 # NEW v3.0 Premium Features 462 verify_emails = actor_input.get('verify_emails', False) # Email verification 463 find_social_profiles = actor_input.get('find_social_profiles', False) # Social media finder 464 score_leads = actor_input.get('score_leads', False) # AI lead scoring 465 premium_enrichment = actor_input.get('premium_enrichment', False) # All premium features 466 467 # Smart detection: If total expected results <= 50 and timeout <= 5min, run in optimized mode 468 total_expected = sum(s.get('max_results', max_results) for s in searches if isinstance(s, dict)) 469 is_qa_run = total_expected <= 50 and timeout_per_search <= 300 470 471 # Enable all premium features if premium_enrichment is True 472 if premium_enrichment: 473 verify_emails = True 474 find_social_profiles = True 475 score_leads = True 476 Actor.log.info('💎 PREMIUM ENRICHMENT MODE: All premium features enabled') 477 478 # 🚀 Performance optimization: Log settings for transparency 479 Actor.log.info(f'⚙️ Configuration: extract_emails={extract_emails}, enhanced_extraction={enhanced_extraction}') 480 Actor.log.info(f'💎 Premium Features: verify_emails={verify_emails}, find_social={find_social_profiles}, score_leads={score_leads}') 481 Actor.log.info(f'⏱️ Timeout per search: {timeout_per_search}s, Expected results: {total_expected}') 482 if not extract_emails and not enhanced_extraction: 483 Actor.log.info(f'🚀 FAST MODE: Basic extraction only (completes in 2-3 min). Enable email/enhanced extraction for premium leads.') 484 485 # Handle different input formats 486 if not searches: 487 query = actor_input.get('query', '') 488 location = actor_input.get('location', '') 489 490 if query: 491 searches = [{'query': query, 'location': location}] 492 Actor.log.info(f'✅ Converted single search: "{query}" in "{location}"') 493 else: 494 Actor.log.error('❌ No search queries provided!') 495 Actor.log.error('Expected format: {"searches": [{"query": "coffee shops", "location": "Seattle, WA"}]}') 496 raise ValueError('No search queries provided') 497 498 Actor.log.info(f'📊 Will process {len(searches)} search(es), max {max_results} results each') 499 500 # Configure Apify proxies properly 501 proxy_configuration = None 502 if use_proxies: 503 try: 504 proxy_configuration = await Actor.create_proxy_configuration() 505 if proxy_configuration: 506 Actor.log.info('✅ Apify proxy configuration created') 507 else: 508 Actor.log.warning('⚠️ No proxy configuration available') 509 except Exception as e: 510 Actor.log.warning(f'⚠️ Proxy setup failed: {e}') 511 512 # Initialize scraper 513 scraper = GoogleMapsLeadScraper() 514 515 # Get headless setting from environment (defaults to True in production) 516 import os 517 is_headless = os.getenv('APIFY_HEADLESS', '1') == '1' 518 519 # Check if running locally (needs --no-sandbox for Chromium) 520 is_local = not os.getenv('APIFY_IS_AT_HOME') 521 522 # Browser launch options 523 browser_launch_options = {} 524 if is_local: 525 # Local development: disable sandbox for compatibility 526 browser_launch_options['args'] = ['--no-sandbox', '--disable-setuid-sandbox'] 527 528 # Configure Crawlee with optimized settings for faster execution 529 crawler = PlaywrightCrawler( 530 proxy_configuration=proxy_configuration, 531 browser_type='chromium', 532 headless=is_headless, 533 request_handler_timeout=timedelta(minutes=8), # Increased timeout 534 max_request_retries=const.MAX_REQUEST_RETRIES, 535 browser_launch_options=browser_launch_options, 536 ) 537 538 # Enhanced request handler 539 @crawler.router.default_handler 540 async def request_handler(context: PlaywrightCrawlingContext) -> None: 541 page = context.page 542 request = context.request 543 544 try: 545 # Extract search info from URL 546 search_info = json.loads(request.user_data.get('search_info', '{}')) 547 search_term = search_info.get('term', 'unknown') 548 max_results_for_search = search_info.get('max_results', max_results) 549 550 # Set timeout for this search 551 search_start_time = datetime.now() 552 search_timeout = timedelta(seconds=timeout_per_search) 553 554 Actor.log.info(f'🔍 Processing: {search_term} (timeout: {timeout_per_search}s)') 555 556 # Set up the page with anti-detection 557 await page.set_viewport_size({'width': 1920, 'height': 1080}) 558 559 # Navigate with reduced retries for speed 560 navigation_successful = False 561 for attempt in range(2): # Reduced from 3 to 2 attempts 562 try: 563 await page.goto(request.url, wait_until='domcontentloaded', timeout=30000) # Reduced timeout 564 navigation_successful = True 565 Actor.log.info(f'✅ Navigation successful on attempt {attempt + 1}') 566 break 567 except Exception as nav_error: 568 Actor.log.warning(f'⚠️ Navigation attempt {attempt + 1} failed: {nav_error}') 569 if attempt < 1: 570 await page.wait_for_timeout(random.randint(2000, 5000)) # Reduced wait 571 572 if not navigation_successful: 573 Actor.log.error(f'❌ Failed to navigate to {request.url}') 574 return 575 576 # Wait for results with reduced timeout 577 await page.wait_for_timeout(3000) # Reduced from 8000ms 578 579 results_found = False 580 selectors_to_try = [ 581 '[role="feed"]', 582 'a[href*="/maps/place/"]', 583 'div[jsaction*="mouseover"]', 584 '[data-result-index]', 585 'div[role="article"]' 586 ] 587 588 for selector in selectors_to_try: 589 try: 590 await page.wait_for_selector(selector, timeout=const.SELECTOR_WAIT_MS) 591 Actor.log.info(f'{const.EMOJI_SUCCESS} Results found with selector: {selector}') 592 results_found = True 593 break 594 except Exception as e: 595 Actor.log.debug(f"Selector '{selector}' not found: {e}") 596 continue 597 598 if not results_found: 599 page_title = await page.title() 600 Actor.log.error(f'❌ No results found. Page title: {page_title}') 601 return 602 603 # Collect businesses with fast extraction (no detailed page visits) 604 all_businesses = [] 605 scroll_attempts = 0 606 max_scroll_attempts = 3 # 🔧 Reduced from 5 to 3 for faster completion 607 last_count = 0 608 609 # Smart result limiting based on available time and expected results 610 max_businesses_to_extract = max_results_for_search 611 if is_qa_run: 612 max_businesses_to_extract = min(max_results_for_search, 50) # 🔧 Increased from 20 to 50 for better free tier 613 614 # Phase 1: Fast collection of basic data 615 basic_businesses = [] 616 while len(basic_businesses) < max_businesses_to_extract and scroll_attempts < max_scroll_attempts: 617 # Check timeout 618 if datetime.now() - search_start_time > search_timeout: 619 Actor.log.warning(f'⏱️ Search timeout reached for: {search_term}') 620 break 621 622 # Extract businesses from current view 623 businesses = await scraper.extract_businesses_from_page(page) 624 625 # Add new unique businesses with filtering 626 for business in businesses: 627 # Apply rating filter 628 if business.get('rating', 0) < minimum_rating: 629 continue 630 631 # Apply review count filter 632 if business.get('review_count', 0) < minimum_reviews: 633 continue 634 635 business_key = f"{business['name']}_{business.get('address', '')}" 636 if business_key not in scraper.processed_businesses: 637 scraper.processed_businesses.add(business_key) 638 Actor.log.info(f'📋 Found business: {business["name"]} ({business.get("rating", 0)}⭐ {business.get("review_count", 0)} reviews)') 639 basic_businesses.append(business) 640 641 Actor.log.info(f'📈 Found {len(basic_businesses)} businesses so far...') 642 643 # Check if we have enough or no new results 644 if len(basic_businesses) >= max_businesses_to_extract: 645 break 646 647 if len(basic_businesses) == last_count: 648 scroll_attempts += 1 649 if scroll_attempts >= 2: # 🔧 Reduced from 3 to 2 for faster exit 650 Actor.log.info('🔚 No new results after scrolling') 651 break 652 else: 653 scroll_attempts = 0 654 last_count = len(basic_businesses) 655 656 # Scroll to load more 657 scrolled = await scraper.scroll_results_panel(page) 658 if not scrolled: 659 Actor.log.info('🔚 Cannot scroll further') 660 break 661 662 # 🔧 Minimal delay for maximum speed 663 await page.wait_for_timeout(random.randint(300, 800)) 664 665 # Phase 2: Smart enhancement for high-value businesses (if enabled) 666 all_businesses = [] 667 time_remaining = search_timeout - (datetime.now() - search_start_time) 668 669 if enhanced_extraction and not is_qa_run: 670 # Sort businesses by value (higher rating + more reviews = more valuable) 671 high_value_businesses = sorted(basic_businesses, 672 key=lambda b: (b.get('rating', 0) * b.get('review_count', 0)), reverse=True) 673 674 Actor.log.info(f'💎 Phase 2: Enhancing high-value businesses with phone/website data') 675 676 enhancement_budget = min(10, len(high_value_businesses)) # 🔧 Reduced from 15 to 10 for speed 677 if time_remaining.total_seconds() > 60: # Only if we have time 678 679 for i, business in enumerate(high_value_businesses[:enhancement_budget]): 680 enhancement_start = datetime.now() 681 682 # Skip if running out of time 683 if (datetime.now() - search_start_time).total_seconds() > timeout_per_search * 0.85: 684 Actor.log.warning(f'⏱️ Enhancement time limit reached, processing remaining {len(high_value_businesses) - i} businesses with basic data') 685 all_businesses.extend(high_value_businesses[i:]) 686 break 687 688 try: 689 Actor.log.info(f'📞 Extracting phone/website for high-value business: {business["name"]} ({business.get("rating", 0)}⭐)') 690 691 # Extract phone number from individual page 692 phone = await scraper.extract_phone_from_business_page(page, business['maps_url']) 693 if phone: 694 business['phone'] = phone 695 Actor.log.info(f'✅ Found phone: {phone}') 696 697 # Extract website if we don't have one 698 if not business.get('website'): 699 website = await scraper.extract_website_from_business_page(page) 700 if website: 701 business['website'] = website 702 Actor.log.info(f'✅ Found website: {website}') 703 704 all_businesses.append(business) 705 706 # Return to search results for next business 707 await page.go_back() 708 await page.wait_for_timeout(500) 709 710 except Exception as e: 711 Actor.log.warning(f'❌ Enhancement failed for {business["name"]}: {e}') 712 all_businesses.append(business) # Add with basic data 713 714 # Time management 715 elapsed = (datetime.now() - enhancement_start).total_seconds() 716 if elapsed > 15: # If taking too long per business 717 Actor.log.warning(f'⏱️ Enhancement taking too long ({elapsed:.1f}s), switching to basic mode') 718 all_businesses.extend(high_value_businesses[i+1:]) 719 break 720 else: 721 Actor.log.info('⚡ Insufficient time for enhancements, using basic data only') 722 all_businesses = basic_businesses 723 else: 724 Actor.log.info('⚡ Enhanced extraction disabled or QA mode, using basic data only') 725 all_businesses = basic_businesses 726 727 # Phase 3: Smart email extraction prioritizing high-value leads 728 if extract_emails and all_businesses: 729 time_left = search_timeout - (datetime.now() - search_start_time) 730 if time_left.total_seconds() > 15: 731 Actor.log.info('📧 Phase 3: Extracting emails from business websites...') 732 733 # Prioritize businesses that already have phone numbers (highest value) 734 businesses_with_websites = [b for b in all_businesses if b.get('website')] 735 businesses_with_phone = [b for b in businesses_with_websites if b.get('phone')] 736 businesses_without_phone = [b for b in businesses_with_websites if not b.get('phone')] 737 738 # Process high-value businesses first (those with phones) 739 email_candidates = businesses_with_phone + businesses_without_phone 740 741 Actor.log.info(f'📧 Email extraction plan: {len(businesses_with_phone)} businesses with phones, {len(businesses_without_phone)} without phones') 742 743 email_budget = min(5, len(email_candidates)) # 🔧 Reduced from 10 to 5 for speed 744 if email_budget > 0: 745 time_per_email = min(4000, max(2000, time_left.total_seconds() * 1000 / email_budget)) # 🔧 Reduced timeout 746 else: 747 Actor.log.info('⚠️ No websites found to extract emails from') 748 time_per_email = 2000 749 750 for i, business in enumerate(email_candidates[:email_budget]): 751 # Stop if running out of time 752 if (datetime.now() - search_start_time).total_seconds() > timeout_per_search * 0.95: 753 Actor.log.warning('⏱️ Email extraction time limit reached') 754 break 755 756 try: 757 priority_marker = "🔥" if business.get('phone') else "📧" 758 Actor.log.info(f'{priority_marker} Extracting email for: {business["name"]}') 759 760 email = await scraper.extract_email_from_website(business['website'], page, int(time_per_email)) 761 if email: 762 business['email'] = email 763 Actor.log.info(f'✅ Found email: {email}') 764 else: 765 business['email'] = '' 766 767 # Quick delay between extractions 768 if i < email_budget - 1: 769 await page.wait_for_timeout(random.randint(500, 1000)) 770 771 except Exception as e: 772 Actor.log.debug(f'Email extraction failed for {business["name"]}: {e}') 773 business['email'] = '' 774 775 # Set empty emails for businesses we didn't process 776 for business in all_businesses: 777 if not business.get('email'): 778 business['email'] = '' 779 780 else: 781 Actor.log.info('⏱️ Insufficient time for email extraction, skipping') 782 for business in all_businesses: 783 business['email'] = '' 784 785 # Prepare and save results 786 result = { 787 'search_query': search_info.get('query', ''), 788 'location': search_info.get('location', ''), 789 'search_term': search_term, 790 'businesses': all_businesses[:max_businesses_to_extract], 791 'total_results': len(all_businesses[:max_businesses_to_extract]), 792 'timestamp': datetime.now().isoformat(), 793 'scraping_info': { 794 'max_requested': max_results_for_search, 795 'extracted': len(all_businesses[:max_businesses_to_extract]), 796 'scroll_attempts': scroll_attempts, 797 'extraction_method': 'hybrid_smart_extraction', 798 'businesses_with_phones': len([b for b in all_businesses if b.get('phone')]), 799 'businesses_with_emails': len([b for b in all_businesses if b.get('email')]), 800 'businesses_with_websites': len([b for b in all_businesses if b.get('website')]), 801 'high_value_leads': len([b for b in all_businesses if b.get('phone') and b.get('email')]) 802 } 803 } 804 805 await context.push_data(result) 806 scraper.results.append(result) 807 808 # Lead quality summary 809 phones_count = len([b for b in result["businesses"] if b.get('phone')]) 810 emails_count = len([b for b in result["businesses"] if b.get('email')]) 811 websites_count = len([b for b in result["businesses"] if b.get('website')]) 812 premium_leads = len([b for b in result["businesses"] if b.get('phone') and b.get('email')]) 813 814 Actor.log.info(f'✅ Extracted {len(result["businesses"])} businesses for "{search_term}"') 815 Actor.log.info(f'💎 Lead Quality: {phones_count} phones | {emails_count} emails | {websites_count} websites | {premium_leads} premium leads (phone+email)') 816 817 if premium_leads > 0: 818 Actor.log.info(f'🔥 HIGH VALUE: {premium_leads} businesses have both phone AND email - these are your money-making leads!') 819 820 except Exception as e: 821 Actor.log.exception(f'❌ Error processing {request.url}: {e}') 822 823 # Prepare URLs for crawler 824 start_urls = [] 825 for search_item in searches: 826 try: 827 if isinstance(search_item, str): 828 query = search_item 829 location = '' 830 else: 831 query = search_item.get('query', '') 832 location = search_item.get('location', '') 833 834 search_term = f"{query} {location}".strip() 835 maps_url = f'https://www.google.com/maps/search/{quote(search_term)}' 836 837 # Create proper Request object with user_data 838 request = Request.from_url( 839 maps_url, 840 user_data={ 841 'search_info': json.dumps({ 842 'query': query, 843 'location': location, 844 'term': search_term, 845 'max_results': search_item.get('max_results', max_results) if isinstance(search_item, dict) else max_results 846 }) 847 } 848 ) 849 start_urls.append(request) 850 851 except Exception as e: 852 Actor.log.error(f'❌ Error preparing search "{search_item}": {e}') 853 854 # Run the crawler with prepared requests 855 Actor.log.info('🚀 Starting crawler...') 856 await crawler.run(start_urls) 857 858 # Collect all businesses from results for enrichment 859 all_extracted_businesses = [] 860 for result in scraper.results: 861 all_extracted_businesses.extend(result.get('businesses', [])) 862 863 # === PREMIUM ENRICHMENT PHASE === 864 enriched_businesses = all_extracted_businesses 865 866 # Run premium enrichment if any premium feature is enabled 867 if (verify_emails or find_social_profiles or score_leads) and all_extracted_businesses: 868 Actor.log.info(f'💎 Starting Premium Enrichment for {len(all_extracted_businesses)} leads...') 869 870 enriched_businesses, _ = await enrich_leads_batch( 871 all_extracted_businesses, 872 verify_emails=verify_emails, 873 find_social=find_social_profiles, 874 score_leads=score_leads, 875 max_concurrent=3 876 ) 877 878 Actor.log.info(f'✅ Enrichment complete: {len(enriched_businesses)} leads processed') 879 880 # Log enrichment results 881 verified_count = len([b for b in enriched_businesses if b.get('email_verification', {}).get('is_deliverable')]) 882 social_count = len([b for b in enriched_businesses if b.get('social_profiles')]) 883 scored_count = len([b for b in enriched_businesses if b.get('lead_score')]) 884 premium_count = len([b for b in enriched_businesses if b.get('is_premium_lead')]) 885 886 Actor.log.info(f'📊 Enrichment Stats: {verified_count} verified emails | {social_count} with social | {scored_count} scored | {premium_count} premium leads') 887 888 # === SIMPLIFIED CHARGING (2 tiers) === 889 Actor.log.info('💰 Processing charges...') 890 891 # Count basic vs enriched leads 892 # Enriched = has email OR social profiles OR lead score 893 enriched_count = len([b for b in enriched_businesses 894 if b.get('email') or b.get('social_profiles') or b.get('lead_score')]) 895 basic_count = len(enriched_businesses) - enriched_count 896 897 # Charge for basic leads (no enrichment) 898 if basic_count > 0: 899 await Actor.charge(event_name='basic_lead', count=basic_count) 900 Actor.log.info(f" 💵 basic_lead: {basic_count} × $0.01 = ${basic_count * 0.01:.3f}") 901 902 # Charge for enriched leads (with email/social/scoring) 903 if enriched_count > 0: 904 await Actor.charge(event_name='enriched_lead', count=enriched_count) 905 Actor.log.info(f" 💵 enriched_lead: {enriched_count} × $0.03 = ${enriched_count * 0.03:.3f}") 906 907 # Calculate total revenue 908 total_revenue = (basic_count * 0.01) + (enriched_count * 0.03) 909 Actor.log.info(f'💰 Total charges: ${total_revenue:.3f}') 910 911 # Push enriched data to dataset 912 for business in enriched_businesses: 913 await Actor.push_data(business) 914 915 # Final summary with extraction success stats 916 total_businesses = len(enriched_businesses) 917 Actor.log.info('=' * 60) 918 Actor.log.info(f'{const.EMOJI_INFO} SCRAPING COMPLETED') 919 Actor.log.info(f'{const.EMOJI_SUCCESS} Searches completed: {len(scraper.results)}/{len(searches)}') 920 Actor.log.info(f'{const.EMOJI_EXTRACTION} Total businesses extracted: {total_businesses}') 921 922 # Log extraction success rates 923 if scraper.stats['email_attempts'] > 0: 924 email_success_rate = (scraper.stats['email_successes'] / scraper.stats['email_attempts']) * 100 925 Actor.log.info(f'{const.EMOJI_EMAIL} Email extraction: {scraper.stats["email_successes"]}/{scraper.stats["email_attempts"]} ({email_success_rate:.1f}% success rate)') 926 927 if scraper.stats['phone_attempts'] > 0: 928 phone_success_rate = (scraper.stats['phone_successes'] / scraper.stats['phone_attempts']) * 100 929 Actor.log.info(f'{const.EMOJI_PHONE} Phone extraction: {scraper.stats["phone_successes"]}/{scraper.stats["phone_attempts"]} ({phone_success_rate:.1f}% success rate)') 930 931 if scraper.stats['website_attempts'] > 0: 932 website_success_rate = (scraper.stats['website_successes'] / scraper.stats['website_attempts']) * 100 933 Actor.log.info(f'{const.EMOJI_WEBSITE} Website extraction: {scraper.stats["website_successes"]}/{scraper.stats["website_attempts"]} ({website_success_rate:.1f}% success rate)') 934 935 Actor.log.info('=' * 60) 936 937 938if __name__ == "__main__": 939 asyncio.run(main())

1""" 2Lead Enrichment Module for LeadLocator Pro 3Provides email verification, social media finding, and AI lead scoring 4Uses free APIs to add value without increasing costs significantly 5""" 6 7import asyncio 8import re 9import hashlib 10from datetime import datetime 11from typing import Any, Dict, List, Optional, Tuple 12from urllib.parse import urlparse, quote 13import aiohttp 14 15from apify import Actor 16 17 18class EmailVerifier: 19 """ 20 Email verification using multiple free methods: 21 1. Syntax validation 22 2. MX record check 23 3. Disposable email detection 24 4. Common pattern validation 25 """ 26 27 # Known disposable email domains 28 DISPOSABLE_DOMAINS = { 29 'tempmail.com', 'throwaway.email', 'guerrillamail.com', 'mailinator.com', 30 '10minutemail.com', 'temp-mail.org', 'fakeinbox.com', 'trashmail.com', 31 'yopmail.com', 'sharklasers.com', 'spam4.me', 'grr.la', 'dispostable.com', 32 'mailnesia.com', 'tempr.email', 'discard.email', 'spamgourmet.com' 33 } 34 35 # Common invalid patterns 36 INVALID_PATTERNS = [ 37 r'^(info|contact|hello|support|admin|sales|marketing|team|help|service)@', 38 r'@(example|test|localhost)\.', 39 r'\+.*@', # Plus addressing often indicates testing 40 ] 41 42 # Email regex pattern 43 EMAIL_REGEX = re.compile( 44 r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' 45 ) 46 47 def __init__(self): 48 self.session = None 49 self.dns_cache = {} 50 51 async def _get_session(self) -> aiohttp.ClientSession: 52 if self.session is None or self.session.closed: 53 self.session = aiohttp.ClientSession( 54 timeout=aiohttp.ClientTimeout(total=10), 55 headers={'User-Agent': 'LeadLocator Pro Email Verifier/3.0'} 56 ) 57 return self.session 58 59 async def close(self): 60 if self.session and not self.session.closed: 61 await self.session.close() 62 63 def validate_syntax(self, email: str) -> bool: 64 """Check if email has valid syntax""" 65 if not email or not isinstance(email, str): 66 return False 67 return bool(self.EMAIL_REGEX.match(email.lower().strip())) 68 69 def is_disposable(self, email: str) -> bool: 70 """Check if email is from a disposable domain""" 71 if not email: 72 return False 73 domain = email.lower().split('@')[-1] 74 return domain in self.DISPOSABLE_DOMAINS 75 76 def is_generic(self, email: str) -> bool: 77 """Check if email is a generic address (info@, contact@, etc.)""" 78 if not email: 79 return False 80 email_lower = email.lower() 81 for pattern in self.INVALID_PATTERNS: 82 if re.match(pattern, email_lower): 83 return True 84 return False 85 86 async def check_mx_record(self, domain: str) -> bool: 87 """Check if domain has valid MX records using DNS over HTTPS""" 88 if domain in self.dns_cache: 89 return self.dns_cache[domain] 90 91 try: 92 session = await self._get_session() 93 # Use Google's DNS over HTTPS API 94 url = f"https://dns.google/resolve?name={domain}&type=MX" 95 async with session.get(url) as response: 96 if response.status == 200: 97 data = await response.json() 98 has_mx = bool(data.get('Answer', [])) 99 self.dns_cache[domain] = has_mx 100 return has_mx 101 except Exception as e: 102 Actor.log.debug(f"MX check failed for {domain}: {e}") 103 104 return True # Assume valid if check fails 105 106 async def verify_email(self, email: str) -> Dict[str, Any]: 107 """ 108 Comprehensive email verification 109 Returns verification result with confidence score 110 """ 111 result = { 112 'email': email, 113 'is_valid': False, 114 'is_deliverable': False, 115 'is_disposable': False, 116 'is_generic': False, 117 'has_mx_record': False, 118 'quality_score': 0, 119 'verification_status': 'unknown' 120 } 121 122 if not email: 123 result['verification_status'] = 'invalid_empty' 124 return result 125 126 email = email.lower().strip() 127 result['email'] = email 128 129 # Step 1: Syntax validation 130 if not self.validate_syntax(email): 131 result['verification_status'] = 'invalid_syntax' 132 return result 133 134 # Step 2: Disposable check 135 result['is_disposable'] = self.is_disposable(email) 136 if result['is_disposable']: 137 result['verification_status'] = 'disposable' 138 result['quality_score'] = 10 139 return result 140 141 # Step 3: Generic address check 142 result['is_generic'] = self.is_generic(email) 143 144 # Step 4: MX record check 145 domain = email.split('@')[-1] 146 result['has_mx_record'] = await self.check_mx_record(domain) 147 148 # Calculate quality score 149 score = 50 # Base score for valid syntax 150 151 if result['has_mx_record']: 152 score += 30 153 154 if not result['is_generic']: 155 score += 15 156 else: 157 score -= 10 # Penalize generic emails slightly 158 159 # Bonus for business domains 160 if not any(d in domain for d in ['gmail', 'yahoo', 'hotmail', 'outlook', 'aol']): 161 score += 5 162 163 result['quality_score'] = min(100, max(0, score)) 164 result['is_valid'] = True 165 result['is_deliverable'] = result['has_mx_record'] 166 result['verification_status'] = 'verified' if result['is_deliverable'] else 'unverified' 167 168 return result 169 170 171class SocialMediaFinder: 172 """ 173 Find social media profiles for businesses using their name and website 174 Uses pattern matching and common URL structures 175 """ 176 177 SOCIAL_PLATFORMS = { 178 'linkedin': { 179 'patterns': [ 180 r'linkedin\.com/company/([^/\s"\'<>]+)', 181 r'linkedin\.com/in/([^/\s"\'<>]+)', 182 ], 183 'url_template': 'https://www.linkedin.com/company/{slug}' 184 }, 185 'facebook': { 186 'patterns': [ 187 r'facebook\.com/([^/\s"\'<>]+)', 188 r'fb\.com/([^/\s"\'<>]+)', 189 ], 190 'url_template': 'https://www.facebook.com/{slug}' 191 }, 192 'twitter': { 193 'patterns': [ 194 r'twitter\.com/([^/\s"\'<>]+)', 195 r'x\.com/([^/\s"\'<>]+)', 196 ], 197 'url_template': 'https://twitter.com/{slug}' 198 }, 199 'instagram': { 200 'patterns': [ 201 r'instagram\.com/([^/\s"\'<>]+)', 202 ], 203 'url_template': 'https://www.instagram.com/{slug}' 204 }, 205 'youtube': { 206 'patterns': [ 207 r'youtube\.com/(?:channel/|c/|user/|@)([^/\s"\'<>]+)', 208 ], 209 'url_template': 'https://www.youtube.com/@{slug}' 210 }, 211 'tiktok': { 212 'patterns': [ 213 r'tiktok\.com/@([^/\s"\'<>]+)', 214 ], 215 'url_template': 'https://www.tiktok.com/@{slug}' 216 } 217 } 218 219 def __init__(self): 220 self.session = None 221 222 async def _get_session(self) -> aiohttp.ClientSession: 223 if self.session is None or self.session.closed: 224 self.session = aiohttp.ClientSession( 225 timeout=aiohttp.ClientTimeout(total=15), 226 headers={ 227 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' 228 } 229 ) 230 return self.session 231 232 async def close(self): 233 if self.session and not self.session.closed: 234 await self.session.close() 235 236 def extract_from_html(self, html: str) -> Dict[str, str]: 237 """Extract social media links from HTML content""" 238 profiles = {} 239 240 for platform, config in self.SOCIAL_PLATFORMS.items(): 241 for pattern in config['patterns']: 242 matches = re.findall(pattern, html, re.IGNORECASE) 243 if matches: 244 # Filter out common false positives 245 valid_slugs = [ 246 m for m in matches 247 if m.lower() not in ['share', 'sharer', 'intent', 'login', 'signup', 'help'] 248 and len(m) > 2 249 ] 250 if valid_slugs: 251 profiles[platform] = config['url_template'].format(slug=valid_slugs[0]) 252 break 253 254 return profiles 255 256 async def find_from_website(self, website_url: str) -> Dict[str, str]: 257 """Scrape business website to find social media links""" 258 if not website_url: 259 return {} 260 261 try: 262 session = await self._get_session() 263 async with session.get(website_url, allow_redirects=True) as response: 264 if response.status == 200: 265 html = await response.text() 266 return self.extract_from_html(html) 267 except Exception as e: 268 Actor.log.debug(f"Failed to fetch {website_url} for social profiles: {e}") 269 270 return {} 271 272 def generate_likely_profiles(self, business_name: str, website: str = None) -> Dict[str, str]: 273 """Generate likely social media profile URLs based on business name""" 274 if not business_name: 275 return {} 276 277 # Create slug from business name 278 slug = re.sub(r'[^a-zA-Z0-9]+', '', business_name.lower()) 279 280 # Try to extract slug from website domain 281 if website: 282 try: 283 parsed = urlparse(website) 284 domain_slug = parsed.netloc.replace('www.', '').split('.')[0] 285 if len(domain_slug) > 3: 286 slug = domain_slug 287 except: 288 pass 289 290 return { 291 'linkedin_likely': f'https://www.linkedin.com/company/{slug}', 292 'facebook_likely': f'https://www.facebook.com/{slug}', 293 } 294 295 296class LeadScorer: 297 """ 298 AI-powered lead scoring based on multiple signals 299 Provides a quality score from 0-100 300 """ 301 302 # Industry value weights (higher = more valuable leads) 303 INDUSTRY_WEIGHTS = { 304 'law': 90, 'legal': 90, 'attorney': 90, 'lawyer': 90, 305 'medical': 85, 'doctor': 85, 'healthcare': 85, 'dental': 85, 'dentist': 85, 306 'real estate': 80, 'realtor': 80, 'property': 80, 307 'financial': 80, 'insurance': 80, 'accounting': 80, 308 'tech': 75, 'software': 75, 'saas': 75, 'it services': 75, 309 'marketing': 70, 'agency': 70, 'consulting': 70, 310 'contractor': 65, 'construction': 65, 'hvac': 65, 'plumber': 65, 'electrician': 65, 311 'restaurant': 50, 'food': 50, 'cafe': 50, 312 'retail': 45, 'shop': 45, 'store': 45, 313 } 314 315 def calculate_score(self, lead: Dict[str, Any]) -> Dict[str, Any]: 316 """ 317 Calculate comprehensive lead quality score 318 Returns score and breakdown 319 """ 320 score = 0 321 breakdown = {} 322 323 # 1. Contact Information Completeness (max 30 points) 324 contact_score = 0 325 if lead.get('phone'): 326 contact_score += 12 327 if lead.get('email'): 328 contact_score += 10 329 if lead.get('website'): 330 contact_score += 8 331 breakdown['contact_completeness'] = contact_score 332 score += contact_score 333 334 # 2. Email Quality (max 15 points) 335 email_score = 0 336 email_verification = lead.get('email_verification', {}) 337 if email_verification.get('is_deliverable'): 338 email_score += 10 339 if not email_verification.get('is_generic'): 340 email_score += 5 341 breakdown['email_quality'] = email_score 342 score += email_score 343 344 # 3. Social Media Presence (max 15 points) 345 social_score = 0 346 social_profiles = lead.get('social_profiles', {}) 347 if social_profiles.get('linkedin'): 348 social_score += 8 349 if social_profiles.get('facebook'): 350 social_score += 4 351 if social_profiles.get('instagram') or social_profiles.get('twitter'): 352 social_score += 3 353 breakdown['social_presence'] = social_score 354 score += social_score 355 356 # 4. Business Reputation (max 25 points) 357 reputation_score = 0 358 rating = lead.get('rating', 0) 359 review_count = lead.get('review_count', 0) 360 361 if rating >= 4.5: 362 reputation_score += 15 363 elif rating >= 4.0: 364 reputation_score += 12 365 elif rating >= 3.5: 366 reputation_score += 8 367 elif rating >= 3.0: 368 reputation_score += 5 369 370 if review_count >= 100: 371 reputation_score += 10 372 elif review_count >= 50: 373 reputation_score += 7 374 elif review_count >= 20: 375 reputation_score += 5 376 elif review_count >= 5: 377 reputation_score += 3 378 379 breakdown['reputation'] = reputation_score 380 score += reputation_score 381 382 # 5. Industry Value (max 15 points) 383 industry_score = 0 384 category = (lead.get('category', '') or '').lower() 385 name = (lead.get('name', '') or '').lower() 386 387 for keyword, weight in self.INDUSTRY_WEIGHTS.items(): 388 if keyword in category or keyword in name: 389 industry_score = int(weight * 0.15) 390 break 391 392 if industry_score == 0: 393 industry_score = 7 # Default for unknown industries 394 395 breakdown['industry_value'] = industry_score 396 score += industry_score 397 398 # Determine lead tier 399 if score >= 80: 400 tier = 'A' 401 tier_label = 'Hot Lead' 402 elif score >= 60: 403 tier = 'B' 404 tier_label = 'Warm Lead' 405 elif score >= 40: 406 tier = 'C' 407 tier_label = 'Cool Lead' 408 else: 409 tier = 'D' 410 tier_label = 'Cold Lead' 411 412 return { 413 'lead_score': min(100, score), 414 'lead_tier': tier, 415 'tier_label': tier_label, 416 'score_breakdown': breakdown, 417 'scored_at': datetime.utcnow().isoformat() 418 } 419 420 421class LeadEnrichmentPipeline: 422 """ 423 Main enrichment pipeline that combines all enrichment services 424 """ 425 426 def __init__(self): 427 self.email_verifier = EmailVerifier() 428 self.social_finder = SocialMediaFinder() 429 self.lead_scorer = LeadScorer() 430 431 async def close(self): 432 await self.email_verifier.close() 433 await self.social_finder.close() 434 435 async def enrich_lead( 436 self, 437 lead: Dict[str, Any], 438 verify_email: bool = True, 439 find_social: bool = True, 440 score_lead: bool = True, 441 page = None # Playwright page for advanced extraction 442 ) -> Tuple[Dict[str, Any], Dict[str, int]]: 443 """ 444 Enrich a single lead with additional data 445 Returns enriched lead and charge counts for billing 446 """ 447 enriched = lead.copy() 448 charges = { 449 'basic_lead': 1, # Always charge for basic lead 450 'email_extracted': 0, 451 'email_verified': 0, 452 'social_profiles_found': 0, 453 'lead_scored': 0, 454 'premium_lead': 0 455 } 456 457 # Track what was found 458 has_verified_email = False 459 has_social = False 460 461 # Email Verification 462 if verify_email and enriched.get('email'): 463 verification = await self.email_verifier.verify_email(enriched['email']) 464 enriched['email_verification'] = verification 465 charges['email_verified'] = 1 466 has_verified_email = verification.get('is_deliverable', False) 467 Actor.log.debug(f"Email verified: {enriched['email']} -> {verification['verification_status']}") 468 469 # Social Media Finding 470 if find_social and enriched.get('website'): 471 # Try to find from website 472 profiles = await self.social_finder.find_from_website(enriched['website']) 473 474 # If no profiles found, generate likely URLs 475 if not profiles: 476 profiles = self.social_finder.generate_likely_profiles( 477 enriched.get('name', ''), 478 enriched.get('website') 479 ) 480 481 if profiles: 482 enriched['social_profiles'] = profiles 483 # Only charge if we found actual profiles (not just likely) 484 if any(not k.endswith('_likely') for k in profiles.keys()): 485 charges['social_profiles_found'] = 1 486 has_social = True 487 488 Actor.log.debug(f"Social profiles for {enriched.get('name')}: {list(profiles.keys())}") 489 490 # Lead Scoring 491 if score_lead: 492 scoring = self.lead_scorer.calculate_score(enriched) 493 enriched.update(scoring) 494 charges['lead_scored'] = 1 495 Actor.log.debug(f"Lead score for {enriched.get('name')}: {scoring['lead_score']} ({scoring['tier_label']})") 496 497 # Check if this is a premium lead (has everything) 498 if has_verified_email and has_social and score_lead: 499 charges['premium_lead'] = 1 500 enriched['is_premium_lead'] = True 501 502 # Add enrichment metadata 503 enriched['enriched_at'] = datetime.utcnow().isoformat() 504 enriched['enrichment_version'] = '3.0' 505 506 return enriched, charges 507 508 509async def enrich_leads_batch( 510 leads: List[Dict[str, Any]], 511 verify_emails: bool = True, 512 find_social: bool = True, 513 score_leads: bool = True, 514 max_concurrent: int = 5 515) -> Tuple[List[Dict[str, Any]], Dict[str, int]]: 516 """ 517 Enrich a batch of leads with controlled concurrency 518 Returns enriched leads and total charges 519 """ 520 pipeline = LeadEnrichmentPipeline() 521 522 total_charges = { 523 'basic_lead': 0, 524 'email_extracted': 0, 525 'email_verified': 0, 526 'social_profiles_found': 0, 527 'lead_scored': 0, 528 'premium_lead': 0 529 } 530 531 enriched_leads = [] 532 semaphore = asyncio.Semaphore(max_concurrent) 533 534 async def enrich_with_semaphore(lead): 535 async with semaphore: 536 return await pipeline.enrich_lead( 537 lead, 538 verify_email=verify_emails, 539 find_social=find_social, 540 score_lead=score_leads 541 ) 542 543 try: 544 tasks = [enrich_with_semaphore(lead) for lead in leads] 545 results = await asyncio.gather(*tasks, return_exceptions=True) 546 547 for result in results: 548 if isinstance(result, Exception): 549 Actor.log.warning(f"Enrichment failed: {result}") 550 continue 551 552 enriched_lead, charges = result 553 enriched_leads.append(enriched_lead) 554 555 for key, value in charges.items(): 556 total_charges[key] += value 557 558 finally: 559 await pipeline.close() 560 561 return enriched_leads, total_charges