
Website Contact Scraper - AI-Powered Lead Finder
Pricing
Pay per event
Go to Store

Website Contact Scraper - AI-Powered Lead Finder
AI-powered website scraper that extracts real contact data from company sites! Finds people, positions, emails & phone numbers using LLM technology. Scans team pages, contact sections & company info. Perfect for B2B lead generation and sales research.
0.0 (0)
Pricing
Pay per event
2
Total users
14
Monthly users
12
Runs succeeded
>99%
Last modified
9 days ago
LLM-Guided Corporate Website Scraper
An advanced Apify actor that uses LLMs (Large Language Models) to identify and extract high-value business contact information from corporate websites.
🚀 Overview
This scraper goes far beyond traditional crawling. It:
- Uses GPT (OpenAI) to intelligently rank internal URLs based on their relevance to contact data
- Maximizes content extraction, including hidden and modal content
- Parses and validates contact fields using LLMs and custom regex preprocessing
- Aggregates data across multiple pages for higher confidence
💡 Key Features
- 🧰 LLM-based URL Evaluation: Scores and selects only the most promising URLs per domain
- 🔍 Maximum Content Extraction: Scrapes visible and hidden elements, emails, phone numbers, and text sections
- 🔧 Custom Prompt Engineering: Tailored prompts for URL scoring and field extraction
- 📊 Smart Aggregation: Merges multiple extractions into one confident, enriched result per domain
- 🚪 Resilient Parsing: Handles edge cases, malformed responses, and fallback scoring
- ✅ GDPR-friendly Proxy Support: With optional German residential proxies
⚙️ Input
This actor expects the following input:
{"urls": ["https://example.com"],"openaiApiKey": "sk-...","maxRequests": 50,"useProxy": true,"enableUrlEvaluation": true,"aggregateResults": true,"includeExtendedFields": true,"costLimit": 1.0}
🔄 Workflow
- Main page is loaded
- LLM evaluates internal links for contact relevance
- Top N URLs are crawled (contact, impressum, team, etc.)
- Content is extracted (even from modals, hidden fields, footers)
- Text is preprocessed for LLM efficiency
- LLM parses the data into a structured JSON object
- Data is validated, weighted, and aggregated into one high-confidence result
🌐 Output Format
Each record pushed to the dataset contains:
{"executive_name": "Max Mustermann","executive_title": "Geschäftsführer","company_email": "info@example.com","company_phone": "+41 44 123 45 67","company_address": "Musterstrasse 1, 8000 Zürich","confidence_score": 0.92,"sources": [...],"aggregated_from_pages": 6,"domain": "example.com"}
📈 Performance & Cost
- Average ~40 websites for 0.07 $ (at gpt-3.5-turbo rates)
- Each domain result is based on up to 8 evaluated subpages
- Internal cost tracking included
🔐 Notes
- Requires valid OpenAI API key (gpt-3.5-turbo)
- Proxy use is optional, but recommended for stable scraping
- Works well for DE/CH/Austria-based companies (Impressum detection)
🚪 Limitations
- Not optimized for dynamic SPAs
- Some LLM responses may still need fallback handling (included)
🚧 Future Improvements
- Add multilingual prompt switching (based on
targetLanguage
input) - Upgrade to gpt-4-turbo for more robust data quality
- Add custom scoring model for aggregation weighting
🌟 Created by Timo Sieber — for smarter, LLM-powered scraping at scale.