Automae Email Extractor avatar

Automae Email Extractor

Pricing

Pay per event

Go to Apify Store
Automae Email Extractor

Automae Email Extractor

An advanced Apify crawler to automatically extract email addresses from any website, with anti-detection protection and Cloudflare decoding.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Theo Jim

Theo Jim

Maintained by Community

Actor stats

0

Bookmarked

16

Total users

4

Monthly active users

18 days ago

Last modified

Share

๐Ÿ•ต๏ธโ€โ™‚๏ธ Intelligent Email Extractor

An advanced Apify crawler to automatically extract email addresses from any website, with anti-detection protection and Cloudflare decoding.

โœจ Features

๐Ÿ” Multi-Source Extraction

  • Mailto links : Direct extraction from mailto: links
  • Cloudflare protection : Automatic decoding of data-cfemail emails
  • Smart regex : Email detection in HTML content
  • Metadata : Extraction from <meta> tags

๐ŸŽฏ Intelligent Navigation

  • Contact pages : Automatic detection of contact pages via keywords
  • Multilingual keywords : French, English, German support
  • Smart scoring : Prioritization of most relevant pages
  • Configurable limits : Control the number of pages to analyze

๐Ÿ›ก๏ธ Anti-Detection Protection

  • Fingerprinting : Realistic browser fingerprint
  • Random delays : Avoids detectable navigation patterns
  • Human headers : Realistic User-Agent and headers
  • Session management : Session pool to avoid bans

๐Ÿ“ง Filtering and Prioritization

  • Whitelist : Priority emails (contact@, hello@, info@, etc.)
  • Blacklist : Filter unwanted emails (no-reply@, etc.)
  • Validation : Email validity verification
  • Deduplication : Automatic duplicate removal

๐Ÿš€ Installation

# Clone the project
git clone <repository-url>
cd apify-get-emails-from-site
# Install dependencies
npm install
# Install Playwright (automatic via postinstall)
npx playwright install --with-deps chromium

๐Ÿ“– Usage

Input Configuration

{
"baseUrl": "https://example.com",
"maxContactPages": 2,
"navigationTimeoutMs": 30000,
"blacklist": ["spam@", "test@", "@example.org"]
}

Parameters:

  • baseUrl (required) : Base URL to analyze
  • maxContactPages (optional, default: 2) : Maximum number of contact pages to analyze
  • navigationTimeoutMs (optional, default: 30000) : Navigation timeout in ms
  • blacklist (optional, default: []) : Array of email patterns to filter out. These patterns are added to the default blacklist (no-reply@, noreply@, donotreply@, @mail.com). Any email containing these patterns will be excluded from results.

Execution

# Local execution
npm start
# Or directly
node main.js

Output

{
"hit": true,
"primaryEmail": "contact@example.com",
"domain": "example.com",
"emails": ["contact@example.com", "info@example.com"],
"sourceUrl": "https://example.com/contact",
"scanned": ["https://example.com", "https://example.com/contact"],
"baseUrl": "https://example.com"
}

๐Ÿ”ง Advanced Configuration

Contact Keywords

The crawler automatically detects contact pages via these keywords:

const contactKeywords = [
"contact", "contact-us", "contactez", "kontakt", "mentions",
"mentions-legales", "legal", "imprint", "impressum", "privacy",
"privacy-policy", "confidentialite", "support", "help", "aide",
"about", "a-propos", "team", "equipe", "cgu", "cgv", "faq"
];

Priority Emails

const roleWhitelist = [
"contact@", "hello@", "info@", "support@", "sales@",
"partners@", "partnership@", "team@", "hi@", "help@"
];

Filtered Emails

The crawler uses a default blacklist to filter unwanted emails:

const roleBlacklist = [
"no-reply@", "noreply@", "donotreply@", "@mail.com"
];

You can extend this blacklist by providing a custom blacklist parameter in the input:

{
"baseUrl": "https://example.com",
"blacklist": ["spam@", "test@", "@example.org"]
}

Any email containing patterns from both the default blacklist and your custom blacklist will be filtered out.

๐Ÿ—๏ธ Architecture

Execution Flow

  1. Initialization : Crawler and queue configuration
  2. Navigation : Main page loading
  3. Extraction : Multi-source email analysis
  4. Decision : If emails found โ†’ stop, else โ†’ contact pages
  5. Result : Return prioritized emails

Anti-Detection Protection

  • Limited concurrency : 1 tab at a time
  • Random delays : 300-800ms before navigation, 200-600ms after
  • Fingerprinting : Unique browser fingerprint
  • Realistic headers : Recent Chrome User-Agent
  • Error handling : Automatic retry on failures

๐Ÿ“Š Performance

  • Configurable timeout : Avoids blocking
  • Request limiting : Server load control
  • Smart stop : Stop as soon as a valid email is found
  • Deduplication : Avoids redundant analysis

๐Ÿ› ๏ธ Dependencies

  • apify (^3.5.0) : Scraping framework
  • crawlee (^3.15.1) : Crawling library
  • playwright (^1.46.0) : Browser automation

๐Ÿ“ Technical Notes

  • ES6 modules : Modern imports/exports usage
  • Error handling : Try/catch for robustness
  • Email validation : Regex and domain verification
  • Relative URLs : Automatic link resolution

๐Ÿค Contributing

Contributions are welcome! Feel free to:

  • Report bugs
  • Suggest improvements
  • Add new contact keywords
  • Optimize performance

๐Ÿ“„ License

This project is under MIT license.