Automae Email Extractor avatar
Automae Email Extractor

Pricing

Pay per event

Go to Apify Store
Automae Email Extractor

Automae Email Extractor

An advanced Apify crawler to automatically extract email addresses from any website, with anti-detection protection and Cloudflare decoding.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Theo Jim

Theo Jim

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

3

Monthly active users

17 days ago

Last modified

Share

πŸ•΅οΈβ€β™‚οΈ Intelligent Email Extractor

An advanced Apify crawler to automatically extract email addresses from any website, with anti-detection protection and Cloudflare decoding.

✨ Features

πŸ” Multi-Source Extraction

  • Mailto links : Direct extraction from mailto: links
  • Cloudflare protection : Automatic decoding of data-cfemail emails
  • Smart regex : Email detection in HTML content
  • Metadata : Extraction from <meta> tags

🎯 Intelligent Navigation

  • Contact pages : Automatic detection of contact pages via keywords
  • Multilingual keywords : French, English, German support
  • Smart scoring : Prioritization of most relevant pages
  • Configurable limits : Control the number of pages to analyze

πŸ›‘οΈ Anti-Detection Protection

  • Fingerprinting : Realistic browser fingerprint
  • Random delays : Avoids detectable navigation patterns
  • Human headers : Realistic User-Agent and headers
  • Session management : Session pool to avoid bans

πŸ“§ Filtering and Prioritization

  • Whitelist : Priority emails (contact@, hello@, info@, etc.)
  • Blacklist : Filter unwanted emails (no-reply@, etc.)
  • Validation : Email validity verification
  • Deduplication : Automatic duplicate removal

πŸš€ Installation

# Clone the project
git clone <repository-url>
cd apify-get-emails-from-site
# Install dependencies
npm install
# Install Playwright (automatic via postinstall)
npx playwright install --with-deps chromium

πŸ“– Usage

Input Configuration

{
"baseUrl": "https://example.com",
"maxContactPages": 2,
"navigationTimeoutMs": 30000
}

Parameters:

  • baseUrl (required) : Base URL to analyze
  • maxContactPages (optional, default: 2) : Maximum number of contact pages to analyze
  • navigationTimeoutMs (optional, default: 30000) : Navigation timeout in ms

Execution

# Local execution
npm start
# Or directly
node main.js

Output

{
"hit": true,
"primaryEmail": "contact@example.com",
"emails": ["contact@example.com", "info@example.com"],
"sourceUrl": "https://example.com/contact",
"scanned": ["https://example.com", "https://example.com/contact"],
"baseUrl": "https://example.com"
}

πŸ”§ Advanced Configuration

Contact Keywords

The crawler automatically detects contact pages via these keywords:

const contactKeywords = [
"contact", "contact-us", "contactez", "kontakt", "mentions",
"mentions-legales", "legal", "imprint", "impressum", "privacy",
"privacy-policy", "confidentialite", "support", "help", "aide",
"about", "a-propos", "team", "equipe", "cgu", "cgv", "faq"
];

Priority Emails

const roleWhitelist = [
"contact@", "hello@", "info@", "support@", "sales@",
"partners@", "partnership@", "team@", "hi@", "help@"
];

Filtered Emails

const roleBlacklist = [
"no-reply@", "noreply@", "donotreply@"
];

πŸ—οΈ Architecture

Execution Flow

  1. Initialization : Crawler and queue configuration
  2. Navigation : Main page loading
  3. Extraction : Multi-source email analysis
  4. Decision : If emails found β†’ stop, else β†’ contact pages
  5. Result : Return prioritized emails

Anti-Detection Protection

  • Limited concurrency : 1 tab at a time
  • Random delays : 300-800ms before navigation, 200-600ms after
  • Fingerprinting : Unique browser fingerprint
  • Realistic headers : Recent Chrome User-Agent
  • Error handling : Automatic retry on failures

πŸ“Š Performance

  • Configurable timeout : Avoids blocking
  • Request limiting : Server load control
  • Smart stop : Stop as soon as a valid email is found
  • Deduplication : Avoids redundant analysis

πŸ› οΈ Dependencies

  • apify (^3.5.0) : Scraping framework
  • crawlee (^3.15.1) : Crawling library
  • playwright (^1.46.0) : Browser automation

πŸ“ Technical Notes

  • ES6 modules : Modern imports/exports usage
  • Error handling : Try/catch for robustness
  • Email validation : Regex and domain verification
  • Relative URLs : Automatic link resolution

🀝 Contributing

Contributions are welcome! Feel free to:

  • Report bugs
  • Suggest improvements
  • Add new contact keywords
  • Optimize performance

πŸ“„ License

This project is under MIT license.