Website Contact Scraper - AI-Powered Lead Finder avatar
Website Contact Scraper - AI-Powered Lead Finder

Pricing

Pay per event

Go to Store
Website Contact Scraper - AI-Powered Lead Finder

Website Contact Scraper - AI-Powered Lead Finder

Developed by

Timo Sieber

Timo Sieber

Maintained by Community

AI-powered website scraper that extracts real contact data from company sites! Finds people, positions, emails & phone numbers using LLM technology. Scans team pages, contact sections & company info. Perfect for B2B lead generation and sales research.

0.0 (0)

Pricing

Pay per event

2

Total users

14

Monthly users

12

Runs succeeded

>99%

Last modified

9 days ago

LLM-Guided Corporate Website Scraper

An advanced Apify actor that uses LLMs (Large Language Models) to identify and extract high-value business contact information from corporate websites.

🚀 Overview

This scraper goes far beyond traditional crawling. It:

  • Uses GPT (OpenAI) to intelligently rank internal URLs based on their relevance to contact data
  • Maximizes content extraction, including hidden and modal content
  • Parses and validates contact fields using LLMs and custom regex preprocessing
  • Aggregates data across multiple pages for higher confidence

💡 Key Features

  • 🧰 LLM-based URL Evaluation: Scores and selects only the most promising URLs per domain
  • 🔍 Maximum Content Extraction: Scrapes visible and hidden elements, emails, phone numbers, and text sections
  • 🔧 Custom Prompt Engineering: Tailored prompts for URL scoring and field extraction
  • 📊 Smart Aggregation: Merges multiple extractions into one confident, enriched result per domain
  • 🚪 Resilient Parsing: Handles edge cases, malformed responses, and fallback scoring
  • GDPR-friendly Proxy Support: With optional German residential proxies

⚙️ Input

This actor expects the following input:

{
"urls": ["https://example.com"],
"openaiApiKey": "sk-...",
"maxRequests": 50,
"useProxy": true,
"enableUrlEvaluation": true,
"aggregateResults": true,
"includeExtendedFields": true,
"costLimit": 1.0
}

🔄 Workflow

  1. Main page is loaded
  2. LLM evaluates internal links for contact relevance
  3. Top N URLs are crawled (contact, impressum, team, etc.)
  4. Content is extracted (even from modals, hidden fields, footers)
  5. Text is preprocessed for LLM efficiency
  6. LLM parses the data into a structured JSON object
  7. Data is validated, weighted, and aggregated into one high-confidence result

🌐 Output Format

Each record pushed to the dataset contains:

{
"executive_name": "Max Mustermann",
"executive_title": "Geschäftsführer",
"company_email": "info@example.com",
"company_phone": "+41 44 123 45 67",
"company_address": "Musterstrasse 1, 8000 Zürich",
"confidence_score": 0.92,
"sources": [...],
"aggregated_from_pages": 6,
"domain": "example.com"
}

📈 Performance & Cost

  • Average ~40 websites for 0.07 $ (at gpt-3.5-turbo rates)
  • Each domain result is based on up to 8 evaluated subpages
  • Internal cost tracking included

🔐 Notes

  • Requires valid OpenAI API key (gpt-3.5-turbo)
  • Proxy use is optional, but recommended for stable scraping
  • Works well for DE/CH/Austria-based companies (Impressum detection)

🚪 Limitations

  • Not optimized for dynamic SPAs
  • Some LLM responses may still need fallback handling (included)

🚧 Future Improvements

  • Add multilingual prompt switching (based on targetLanguage input)
  • Upgrade to gpt-4-turbo for more robust data quality
  • Add custom scoring model for aggregation weighting

🌟 Created by Timo Sieber — for smarter, LLM-powered scraping at scale.