Automae Email Extractor avatar
Automae Email Extractor

Pricing

$0.00 / actor start

Go to Apify Store
Automae Email Extractor

Automae Email Extractor

An advanced Apify crawler to automatically extract email addresses from any website, with anti-detection protection and Cloudflare decoding.

Pricing

$0.00 / actor start

Rating

0.0

(0)

Developer

Theo Jim

Theo Jim

Maintained by Community

Actor stats

0

Bookmarked

9

Total users

5

Monthly active users

a month ago

Last modified

Share

πŸ•΅οΈβ€β™‚οΈ Intelligent Email Extractor

An advanced Apify crawler to automatically extract email addresses from any website, with anti-detection protection and Cloudflare decoding.

✨ Features

πŸ” Multi-Source Extraction

  • Mailto links : Direct extraction from mailto: links
  • Cloudflare protection : Automatic decoding of data-cfemail emails
  • Smart regex : Email detection in HTML content
  • Metadata : Extraction from <meta> tags

🎯 Intelligent Navigation

  • Contact pages : Automatic detection of contact pages via keywords
  • Multilingual keywords : French, English, German support
  • Smart scoring : Prioritization of most relevant pages
  • Configurable limits : Control the number of pages to analyze

πŸ›‘οΈ Anti-Detection Protection

  • Fingerprinting : Realistic browser fingerprint
  • Random delays : Avoids detectable navigation patterns
  • Human headers : Realistic User-Agent and headers
  • Session management : Session pool to avoid bans

πŸ“§ Filtering and Prioritization

  • Whitelist : Priority emails (contact@, hello@, info@, etc.)
  • Blacklist : Filter unwanted emails (no-reply@, etc.)
  • Validation : Email validity verification
  • Deduplication : Automatic duplicate removal

πŸš€ Installation

# Clone the project
git clone <repository-url>
cd apify-get-emails-from-site
# Install dependencies
npm install
# Install Playwright (automatic via postinstall)
npx playwright install --with-deps chromium

πŸ“– Usage

Input Configuration

{
"baseUrl": "https://example.com",
"maxContactPages": 2,
"navigationTimeoutMs": 30000
}

Parameters:

  • baseUrl (required) : Base URL to analyze
  • maxContactPages (optional, default: 2) : Maximum number of contact pages to analyze
  • navigationTimeoutMs (optional, default: 30000) : Navigation timeout in ms

Execution

# Local execution
npm start
# Or directly
node main.js

Output

{
"hit": true,
"primaryEmail": "contact@example.com",
"emails": ["contact@example.com", "info@example.com"],
"sourceUrl": "https://example.com/contact",
"scanned": ["https://example.com", "https://example.com/contact"],
"baseUrl": "https://example.com"
}

πŸ”§ Advanced Configuration

Contact Keywords

The crawler automatically detects contact pages via these keywords:

const contactKeywords = [
"contact", "contact-us", "contactez", "kontakt", "mentions",
"mentions-legales", "legal", "imprint", "impressum", "privacy",
"privacy-policy", "confidentialite", "support", "help", "aide",
"about", "a-propos", "team", "equipe", "cgu", "cgv", "faq"
];

Priority Emails

const roleWhitelist = [
"contact@", "hello@", "info@", "support@", "sales@",
"partners@", "partnership@", "team@", "hi@", "help@"
];

Filtered Emails

const roleBlacklist = [
"no-reply@", "noreply@", "donotreply@"
];

πŸ—οΈ Architecture

Execution Flow

  1. Initialization : Crawler and queue configuration
  2. Navigation : Main page loading
  3. Extraction : Multi-source email analysis
  4. Decision : If emails found β†’ stop, else β†’ contact pages
  5. Result : Return prioritized emails

Anti-Detection Protection

  • Limited concurrency : 1 tab at a time
  • Random delays : 300-800ms before navigation, 200-600ms after
  • Fingerprinting : Unique browser fingerprint
  • Realistic headers : Recent Chrome User-Agent
  • Error handling : Automatic retry on failures

πŸ“Š Performance

  • Configurable timeout : Avoids blocking
  • Request limiting : Server load control
  • Smart stop : Stop as soon as a valid email is found
  • Deduplication : Avoids redundant analysis

πŸ› οΈ Dependencies

  • apify (^3.5.0) : Scraping framework
  • crawlee (^3.15.1) : Crawling library
  • playwright (^1.46.0) : Browser automation

πŸ“ Technical Notes

  • ES6 modules : Modern imports/exports usage
  • Error handling : Try/catch for robustness
  • Email validation : Regex and domain verification
  • Relative URLs : Automatic link resolution

🀝 Contributing

Contributions are welcome! Feel free to:

  • Report bugs
  • Suggest improvements
  • Add new contact keywords
  • Optimize performance

πŸ“„ License

This project is under MIT license.