Email Extractor Pro — Bulk Website Emails, No Hunter.io Cap avatar

Email Extractor Pro — Bulk Website Emails, No Hunter.io Cap

Pricing

Pay per usage

Go to Apify Store
Email Extractor Pro — Bulk Website Emails, No Hunter.io Cap

Email Extractor Pro — Bulk Website Emails, No Hunter.io Cap

Outreach-ready email lists in 2 min — emails + page title + URL + role hint as CSV. No Hunter.io cap, no Apollo seat fee. 109 lifetime runs · 10 paying users. For B2B prospecting + sales outreach + recruiter sourcing. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Alex

Alex

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

5

Monthly active users

28 minutes ago

Last modified

Share

Extract emails, phone numbers, and social media links from any list of websites. The crawler walks each domain on the same-host strategy, deduplicates, filters junk, and returns a flat dataset ready for outreach. No 25-results-per-month cap, no per-lookup pricing — pay per Apify compute unit and run as many domains as you need.

107 lifetime production runs on this exact actor as of 2026-04-30.


What this actor returns (verified against src/main.js)

When at least one email is found, one record is emitted per email:

{
"email": "contact@example.com",
"source": "https://example.com/contact",
"domain": "example.com",
"fromMailto": true,
"phones": ["+1 (555) 123-4567", "+44 20 7946 0958"],
"socialLinks": [
{ "platform": "linkedin", "url": "https://linkedin.com/company/example", "foundOn": "https://example.com/contact" }
],
"scrapedAt": "2026-04-29T12:00:00.000Z"
}

When zero emails are found, a single summary record is emitted (still includes phones + socials if any):

{
"email": null,
"message": "No emails found on the provided URLs",
"phones": ["..."],
"socialLinks": [{ "platform": "...", "url": "...", "foundOn": "..." }],
"urlsScanned": 5,
"scrapedAt": "2026-04-29T12:00:00.000Z"
}

If neither emails, phones, nor social links are found, the record contains only email: null, message, urlsScanned, scrapedAt.

fromMailto: true flags emails extracted from <a href="mailto:..."> links — these are highest-confidence captures. Regex-extracted emails do not carry fromMailto: false; the field is simply absent. Code your downstream as email.get('fromMailto', False).


Features (verified against src/main.js)

  • Deep crawling — follows internal links up to maxDepth (configurable 0–5 in input_schema).
  • mailto: and tel: priority<a href="mailto:..."> and <a href="tel:..."> extracted directly via DOM selector first, then text regex sweeps the rest.
  • 7 social platforms — LinkedIn, Twitter / X, Facebook, Instagram, YouTube (@, /channel/, /c/), GitHub, TikTok (@handle form only).
  • Junk filter — drops noreply@, no-reply@, donotreply@, *@example.com, *@test.com, *@localhost, sentry.io, wixpress.com, wordpress.com, and image-suffix false positives (*.png, *.jpg, *.gif, *.svg, *@2x.*, *@3x.*). Also drops anything > 100 chars.
  • Email dedup — emails are lowercased + Map-based dedup runs across the whole crawl regardless of the deduplicateEmails flag (see Honest Limitations).
  • Phone regex — requires either a +CC international prefix, a (NNN) group, or NNN-NNN-NNNN separators — avoids matching plain numeric IDs / SKU strings.
  • Same-domain enforcementstrategy: 'same-domain' plus a transformRequestFunction that re-checks linkDomain === targetDomain. Won't wander into ad networks or analytics CDNs.

Use cases

  • Lead generation — build email lists from category-targeted seed company sites
  • Sales prospecting — get decision-maker contacts across hundreds of target companies in one run
  • PR / link building — collect journalist + blogger emails for outreach
  • Recruitment — extract recruiter contacts from agency / freelancer sites
  • Data enrichment — append email + phone + social to an existing CRM dump

Input parameters (verified against .actor/input_schema.json)

ParameterTypeDefaultRangeDescription
urlsArray[] (required)Website URLs to scan, with or without https:// prefix
maxPagesPerDomainInteger201–100Pages per domain (see budget caveat below)
maxDepthInteger20–5Link depth (0 = provided URLs only)
includePhonesBooleantrueExtract phone numbers (text + tel: href)
includeSocialLinksBooleantrueExtract social profile URLs
deduplicateEmailsBooleantrueCurrently a no-op — see Honest Limitations

Honest limitations

  • deduplicateEmails is a dead input. It is destructured from input (line 11 of src/main.js) but never referenced again. Email dedup via the in-memory Map keyed by lowercase email always runs regardless of this flag's value. Setting it to false does not produce duplicate rows. The schema field is kept for backwards compatibility; treat it as informational only.
  • The "priority routing" for /contact / /about / /team / /imprint / /impressum / /privacy / /legal paths is cosmetic. The transformRequestFunction sets request.userData.priority = 1 on those URLs, but Crawlee's RequestQueue does not consult userData.priority for queue ordering — pages are processed FIFO. To genuinely front-load contact pages, request a custom build using forefront: true on addRequests or a dedicated priority queue.
  • maxRequestsPerCrawl = urls.length × maxPagesPerDomain. This is a shared budget pool, not per-domain. If domain #1 has many internal links, it can consume far more than maxPagesPerDomain requests and starve domains #2-#N. To get a hard per-domain cap, run domains in separate Apify runs.
  • maxConcurrency: 10 is global, not per-domain. Five domains share the same 10-request concurrency budget. Larger fan-outs need a custom build.
  • Per-email socialLinks are filtered by foundOn === source URL. A social link discovered on /about while the email lives on /contact will not appear in that email's record (it does still appear in the global crawl state but is not joined to that email).
  • Phone numbers are pooled globally per run, not per email. Every email record gets the same phones: [...allPhones] array. There is no email→phone proximity join.
  • No proxy by default. CheerioCrawler is constructed without proxyConfiguration. Targets that block datacenter IPs (Cloudflare-protected, anti-bot-walled sites) will return 403/empty. For proxy-routed runs, request a custom build.
  • Static HTML only. No headless browser → JavaScript-rendered email/contact pages return no data. About 30 % of modern marketing sites move contact info into client-rendered React; this actor cannot extract those.
  • Phone regex over-fires on long invoice numbers / part numbers that match NNN-NNN-NNNN shapes. The output is debouncedly filtered by digit count (7–15) but borderline values still slip through.
  • Junk filter is conservative. Real emails on wordpress.com or wixpress.com mailbox subdomains are filtered out (false negatives) — a tradeoff for cleaner output.

Cost

Apify charges per compute unit and per result, not per email. Concrete cost depends on page weight, redirects, and maxDepth. As of 2026-04-29, a 50-domain run with maxPagesPerDomain: 20, maxDepth: 2 typically completes inside Apify's $5 free-tier credit; heavier runs scale linearly with crawled pages. Run a small test (1–2 domains) and check Console → Run Cost before scaling.


Quick start

  1. Open the actor → Try for free.
  2. Paste target URLs:
{
"urls": ["https://competitor1.com", "https://competitor2.com"],
"maxPagesPerDomain": 20,
"maxDepth": 2
}
  1. Click Start. Results stream into the dataset (JSON / CSV / Excel export).

Pulling results from your code

Python (apify-client):

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("knotless_cadence/email-extractor-pro").call(run_input={
"urls": ["https://example.com"],
"maxDepth": 2,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["email"], "—", item.get("domain"), "—", item.get("fromMailto", False))

JavaScript (fetch):

const res = await fetch(
`https://api.apify.com/v2/acts/knotless_cadence~email-extractor-pro/runs/last/dataset/items?token=YOUR_TOKEN`
);
const contacts = await res.json();

How it works

  1. Seed — homepage HTML is fetched.
  2. Link discovery — internal links extracted via Crawlee enqueueLinks({ strategy: 'same-domain' }). The transformRequestFunction filters cross-domain links and tags priority (cosmetic).
  3. Email pass — regex over rendered HTML + body.text() + mailto: href extraction.
  4. Phone passtel: href + strict text regex (separator-or-prefix required).
  5. Social pass — 7-platform regex over raw HTML (catches links inside footer markup, JSON-LD, attribute strings).
  6. Filter + dedup — junk patterns rejected, lowercase normalization, Map-based dedup.

Combine with other actors in this portfolio

  1. Google Maps Scraper Pro — find businesses by category + location.
  2. Email Extractor Pro (this one) — pull their contact emails.
  3. Email Validator — verify deliverability before outreach.
  4. Website Tech Stack Detector — qualify leads by their stack.

Proof of delivery: 31 published Apify actors (78 total in portfolio). The flagship Trustpilot scraper has 951 lifetime production runs; this Email Extractor has 107+ runs. One paid 3-article series shipped in March 2026 ($150, proxy industry). Pilot pricing locked through May 2026.

Sample request? Reply sample to spinov001@gmail.com and we'll send 2 published case-study articles within 24 hours.


Need a custom build instead of self-serve?

TierPriceIncludes
Pilot$971 actor or modification, 7-day support
Standard$297Custom actor + Slack/email alerts on results, 30-day support
Premium$797Custom actor + dashboard + 90-day support + 1 modification round

Drop specs, schema, or target URLs in an email — quote back same day.

Email: spinov001@gmail.com

Proof of work: 31 published Apify actors / 78 total in portfolio — 951 lifetime runs on the Trustpilot scraper, paid 3-article series delivered for a client in the proxy industry ($150).

Blog (case studies + writeups): blog.spinov.online

Telegram channel (scraping & data engineering tips): t.me/scraping_ai


Honest disclosure

  • 107 lifetime production runs on this specific actor as of 2026-04-30 — well past prototype, but smaller-volume than the Trustpilot flagship.
  • No private data scraped. No login bypass. robots.txt is not explicitly checked (Crawlee default). For strict compliance with robots.txt, request a custom build that wires it in.
  • Independent project — not affiliated with Hunter.io, Apollo, or any other lead-gen vendor.
  • This actor is maintained by the same author who runs apify.com/knotless_cadence (78 actors, 31 public).