Website Contact Scraper - AI-Powered Lead Finder avatar
Website Contact Scraper - AI-Powered Lead Finder

Pricing

Pay per event

Go to Store
Website Contact Scraper - AI-Powered Lead Finder

Website Contact Scraper - AI-Powered Lead Finder

Developed by

Timo Sieber

Timo Sieber

Maintained by Community

AI-powered website scraper that extracts real contact data from company sites! Finds people, positions, emails & phone numbers using LLM technology. Scans team pages, contact sections & company info. Perfect for B2B lead generation and sales research.

0.0 (0)

Pricing

Pay per event

2

Total users

5

Monthly users

5

Runs succeeded

>99%

Last modified

21 days ago

Advanced Contact and Company Data Scraper with LLM (JavaScript)

This advanced template scrapes company websites using AI-powered analysis to extract structured contact information and company data. The actor uses OpenAI's GPT models to intelligently identify real people, their positions, and contact details while filtering out marketing content and placeholder text.

Unlike basic scrapers that rely on pattern matching, this actor leverages Large Language Models (LLMs) to understand context and extract only genuine contact information from Swiss and German company websites.

Key Features

  • Apify SDK – toolkit for building Actors
  • Input schema – validates input parameters including OpenAI API key
  • Dataset – stores structured company and contact data
  • Axios client – HTTP client with retry logic and proper headers
  • Cheerio – HTML parsing and DOM manipulation
  • OpenAI API – GPT models for intelligent content analysis
  • Multi-page crawling – discovers and analyzes multiple pages from the same domain
  • Smart URL evaluation – AI-powered scoring of page relevance for contact data
  • Swiss/German phone formatting – proper formatting for +41 and +49 numbers
  • Aggressive content filtering – removes navigation, ads, and irrelevant content

How it works

Phase 1: URL Discovery

  1. collectAllUrls() crawls the website starting from the main URL
  2. Discovers internal links up to specified depth (default: 2 levels)
  3. Filters out non-content URLs (images, PDFs, etc.)
  4. Collects page titles and H1 tags for context

Phase 2: Intelligent Page Evaluation

  1. evaluateUrlsWithLLM() uses GPT to score each discovered URL
  2. Evaluates likelihood of containing contact information (0-10 scale)
  3. Prioritizes pages like "Team", "Kontakt", "About Us", "Impressum"
  4. Homepage always receives high priority score

Phase 3: AI-Powered Content Extraction

  1. scrapeRelevantPages() processes only high-scoring pages
  2. extractContactsWithLLM() uses GPT to identify real people:
    • Filters out marketing slogans and placeholder text
    • Recognizes proper name structures (Vor- und Nachname)
    • Identifies positions and contact details near names
    • Removes fake names like "Max Mustermann"
  3. extractCompanyInfo() finds general company contact data:
    • mailto: and tel: links
    • JSON-LD structured data
    • Footer and header contact information
  4. extractCompanyNameWithLLM() determines official company name

Phase 4: Data Validation and Deduplication

  1. Validates email addresses and phone numbers
  2. Formats Swiss phone numbers (+41 xxx xxx xx xx)
  3. Deduplicates contacts by name (allows multiple people with same email)
  4. Filters out invalid or placeholder data

Input schema

{
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The target website to scrape (must be a non-empty string)."
},
"maxPages": {
"type": "integer",
"description": "Maximum number of pages to process.",
"default": 50
},
"openaiApiKey": {
"type": "string",
"description": "OpenAI API key (must start with 'sk-'). Required for LLM analysis."
},
"llmModel": {
"type": "string",
"description": "OpenAI model to use for analysis.",
"default": "gpt-3.5-turbo"
},
"minRelevanceScore": {
"type": "integer",
"description": "Minimum relevance score for pages to be scraped (0-10).",
"default": 7
},
"useJavaScript": {
"type": "boolean",
"description": "Enable JavaScript rendering (currently not implemented).",
"default": true
},
"title": {
"type": "string",
"description": "Optional custom title for the output.",
"default": null
}
},
"required": ["url", "openaiApiKey"]
}

Required Parameters:

  • url (string): The website to scrape
  • openaiApiKey (string): Valid OpenAI API key starting with "sk-"

Optional Parameters:

  • maxPages (integer): Maximum pages to analyze (default: 50)
  • llmModel (string): OpenAI model (default: "gpt-3.5-turbo")
  • minRelevanceScore (integer): Minimum page score to scrape (default: 7)
  • title (string): Custom title for output

Example output

The scraper produces one comprehensive dataset item per run:

{
"title": "IBS Haustechnik AG",
"companyName": "IBS Haustechnik AG",
"website": "https://ibs-haustechnik.ch",
"generalEmail": "info@ibs-haustechnik.ch",
"generalPhone": "062 849 49 49",
"contacts": [
{
"name": "Gabriel Ziegler",
"position": "Geschäftsführer",
"email": "gabriel.ziegler@ibs-haustechnik.ch",
"phone": "078 966 88 41"
},
{
"name": "Anna Meier",
"position": "Marketing Manager",
"email": "anna.meier@ibs-haustechnik.ch",
"phone": null
}
]
}

Error handling:

If errors occur, the output includes error information:

{
"title": "Fehler beim Scraping",
"companyName": "Fehler beim Scraping",
"website": "https://invalid-url.test",
"generalEmail": null,
"generalPhone": null,
"contacts": [],
"error": "getaddrinfo ENOTFOUND invalid-url.test"
}

Key Differences from Basic Scraper

Advanced AI-Powered Features:

  • LLM Content Analysis: Uses GPT to understand context and identify real people
  • Intelligent Filtering: Removes marketing content, slogans, and placeholder text
  • Multi-page Discovery: Automatically finds and evaluates relevant pages
  • Company Data Extraction: Finds official company names and general contact info
  • Swiss/German Optimization: Specialized for DACH region websites and phone formats

Validation and Quality:

  • Strict Name Validation: Filters out fake names and marketing terms
  • Email Validation: Removes placeholder and invalid email addresses
  • Phone Formatting: Proper Swiss (+41) and German (+49) number formatting
  • Content Cleaning: Aggressive removal of navigation, ads, and irrelevant content

Development and local testing

  1. Clone the Actor

    apify pull <ActorId>
    cd <ActorDirectory>
  2. Install dependencies

    $npm install
  3. Set up OpenAI API Key

    • Get an API key from OpenAI
    • Add it to your INPUT.json file
  4. Configure input

    {
    "url": "https://example-company.ch",
    "openaiApiKey": "sk-your-api-key-here",
    "maxPages": 20,
    "minRelevanceScore": 7
    }
  5. Run locally

    $npx apify run
  6. Inspect output Check apify_storage/datasets/default/*.json for results

Performance and Costs

Processing Time:

  • Single page: ~30-60 seconds
  • Full website (10-20 pages): ~5-15 minutes
  • Large websites (50+ pages): ~20-45 minutes

OpenAI API Costs:

  • GPT-3.5-turbo: ~$0.10-0.50 per website
  • GPT-4: ~$2.00-10.00 per website
  • Cost depends on website size and content complexity

Rate Limiting:

  • Built-in delays between requests (1-2 seconds)
  • Respects website robots.txt (recommended)
  • OpenAI API rate limits handled automatically

Best Practices

For optimal results:

  1. Use specific URLs: Point to company websites with clear contact sections
  2. Adjust relevance score: Lower for smaller sites, higher for large corporate sites
  3. Monitor API usage: Track OpenAI costs, especially with GPT-4
  4. Test with sample sites: Verify output quality before bulk processing

Common use cases:

  • Lead generation: Extract contacts from prospect company websites
  • Data enrichment: Enhance existing company databases
  • Market research: Analyze competitor team structures
  • Sales automation: Build contact lists for outreach campaigns