
Website Contact Scraper - AI-Powered Lead Finder
Pricing
Pay per event

Website Contact Scraper - AI-Powered Lead Finder
AI-powered website scraper that extracts real contact data from company sites! Finds people, positions, emails & phone numbers using LLM technology. Scans team pages, contact sections & company info. Perfect for B2B lead generation and sales research.
0.0 (0)
Pricing
Pay per event
2
Total users
5
Monthly users
5
Runs succeeded
>99%
Last modified
21 days ago
Advanced Contact and Company Data Scraper with LLM (JavaScript)
This advanced template scrapes company websites using AI-powered analysis to extract structured contact information and company data. The actor uses OpenAI's GPT models to intelligently identify real people, their positions, and contact details while filtering out marketing content and placeholder text.
Unlike basic scrapers that rely on pattern matching, this actor leverages Large Language Models (LLMs) to understand context and extract only genuine contact information from Swiss and German company websites.
Key Features
- Apify SDK – toolkit for building Actors
- Input schema – validates input parameters including OpenAI API key
- Dataset – stores structured company and contact data
- Axios client – HTTP client with retry logic and proper headers
- Cheerio – HTML parsing and DOM manipulation
- OpenAI API – GPT models for intelligent content analysis
- Multi-page crawling – discovers and analyzes multiple pages from the same domain
- Smart URL evaluation – AI-powered scoring of page relevance for contact data
- Swiss/German phone formatting – proper formatting for +41 and +49 numbers
- Aggressive content filtering – removes navigation, ads, and irrelevant content
How it works
Phase 1: URL Discovery
collectAllUrls()
crawls the website starting from the main URL- Discovers internal links up to specified depth (default: 2 levels)
- Filters out non-content URLs (images, PDFs, etc.)
- Collects page titles and H1 tags for context
Phase 2: Intelligent Page Evaluation
evaluateUrlsWithLLM()
uses GPT to score each discovered URL- Evaluates likelihood of containing contact information (0-10 scale)
- Prioritizes pages like "Team", "Kontakt", "About Us", "Impressum"
- Homepage always receives high priority score
Phase 3: AI-Powered Content Extraction
scrapeRelevantPages()
processes only high-scoring pagesextractContactsWithLLM()
uses GPT to identify real people:- Filters out marketing slogans and placeholder text
- Recognizes proper name structures (Vor- und Nachname)
- Identifies positions and contact details near names
- Removes fake names like "Max Mustermann"
extractCompanyInfo()
finds general company contact data:- mailto: and tel: links
- JSON-LD structured data
- Footer and header contact information
extractCompanyNameWithLLM()
determines official company name
Phase 4: Data Validation and Deduplication
- Validates email addresses and phone numbers
- Formats Swiss phone numbers (+41 xxx xxx xx xx)
- Deduplicates contacts by name (allows multiple people with same email)
- Filters out invalid or placeholder data
Input schema
{"type": "object","properties": {"url": {"type": "string","description": "The target website to scrape (must be a non-empty string)."},"maxPages": {"type": "integer","description": "Maximum number of pages to process.","default": 50},"openaiApiKey": {"type": "string","description": "OpenAI API key (must start with 'sk-'). Required for LLM analysis."},"llmModel": {"type": "string","description": "OpenAI model to use for analysis.","default": "gpt-3.5-turbo"},"minRelevanceScore": {"type": "integer","description": "Minimum relevance score for pages to be scraped (0-10).","default": 7},"useJavaScript": {"type": "boolean","description": "Enable JavaScript rendering (currently not implemented).","default": true},"title": {"type": "string","description": "Optional custom title for the output.","default": null}},"required": ["url", "openaiApiKey"]}
Required Parameters:
url
(string): The website to scrapeopenaiApiKey
(string): Valid OpenAI API key starting with "sk-"
Optional Parameters:
maxPages
(integer): Maximum pages to analyze (default: 50)llmModel
(string): OpenAI model (default: "gpt-3.5-turbo")minRelevanceScore
(integer): Minimum page score to scrape (default: 7)title
(string): Custom title for output
Example output
The scraper produces one comprehensive dataset item per run:
{"title": "IBS Haustechnik AG","companyName": "IBS Haustechnik AG","website": "https://ibs-haustechnik.ch","generalEmail": "info@ibs-haustechnik.ch","generalPhone": "062 849 49 49","contacts": [{"name": "Gabriel Ziegler","position": "Geschäftsführer","email": "gabriel.ziegler@ibs-haustechnik.ch","phone": "078 966 88 41"},{"name": "Anna Meier","position": "Marketing Manager","email": "anna.meier@ibs-haustechnik.ch","phone": null}]}
Error handling:
If errors occur, the output includes error information:
{"title": "Fehler beim Scraping","companyName": "Fehler beim Scraping","website": "https://invalid-url.test","generalEmail": null,"generalPhone": null,"contacts": [],"error": "getaddrinfo ENOTFOUND invalid-url.test"}
Key Differences from Basic Scraper
Advanced AI-Powered Features:
- LLM Content Analysis: Uses GPT to understand context and identify real people
- Intelligent Filtering: Removes marketing content, slogans, and placeholder text
- Multi-page Discovery: Automatically finds and evaluates relevant pages
- Company Data Extraction: Finds official company names and general contact info
- Swiss/German Optimization: Specialized for DACH region websites and phone formats
Validation and Quality:
- Strict Name Validation: Filters out fake names and marketing terms
- Email Validation: Removes placeholder and invalid email addresses
- Phone Formatting: Proper Swiss (+41) and German (+49) number formatting
- Content Cleaning: Aggressive removal of navigation, ads, and irrelevant content
Development and local testing
-
Clone the Actor
apify pull <ActorId>cd <ActorDirectory> -
Install dependencies
$npm install -
Set up OpenAI API Key
- Get an API key from OpenAI
- Add it to your
INPUT.json
file
-
Configure input
{"url": "https://example-company.ch","openaiApiKey": "sk-your-api-key-here","maxPages": 20,"minRelevanceScore": 7} -
Run locally
$npx apify run -
Inspect output Check
apify_storage/datasets/default/*.json
for results
Performance and Costs
Processing Time:
- Single page: ~30-60 seconds
- Full website (10-20 pages): ~5-15 minutes
- Large websites (50+ pages): ~20-45 minutes
OpenAI API Costs:
- GPT-3.5-turbo: ~$0.10-0.50 per website
- GPT-4: ~$2.00-10.00 per website
- Cost depends on website size and content complexity
Rate Limiting:
- Built-in delays between requests (1-2 seconds)
- Respects website robots.txt (recommended)
- OpenAI API rate limits handled automatically
Best Practices
For optimal results:
- Use specific URLs: Point to company websites with clear contact sections
- Adjust relevance score: Lower for smaller sites, higher for large corporate sites
- Monitor API usage: Track OpenAI costs, especially with GPT-4
- Test with sample sites: Verify output quality before bulk processing
Common use cases:
- Lead generation: Extract contacts from prospect company websites
- Data enrichment: Enhance existing company databases
- Market research: Analyze competitor team structures
- Sales automation: Build contact lists for outreach campaigns