Website Contact Crawler avatar

Website Contact Crawler

Pricing

Pay per usage

Go to Apify Store
Website Contact Crawler

Website Contact Crawler

Crawls websites to extract emails, phones, and social links.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Man Mohit verma

Man Mohit verma

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

2

Monthly active users

4 days ago

Last modified

Share

Python Apify Actor that crawls a list of start URLs, follows links up to a configurable depth, and extracts:

  • email addresses
  • phone numbers
  • Facebook, X/Twitter, WhatsApp, YouTube, Instagram, and LinkedIn links

Each extracted contact is stored with:

  • startingUrl — seed URL for the crawl branch
  • currentPage — URL that was requested and crawled (e.g. /pages/contact-us)
  • pageFetched — final URL after HTTP redirects, where the HTML was parsed
  • type
  • value

Output

  1. Default dataset — one row per unique contact (standard Apify export as JSON/CSV).
  2. Key-Value Store
    • contacts.json — full aggregated array of all contacts from the run.
    • pages-scraped.json — per seed URL, all HTML pages that were successfully scraped (startingUrl + pagesScraped array).

Input

  • startUrls: list of seed URLs (JSON array; supports large lists such as ~1,000 sites)
  • depthOfPages: crawl depth from each seed URL
  • defaultPhoneRegion: default region for phonenumbers
  • maxConcurrencyPerIp: concurrent fetches per worker band (default 50)
  • proxyPoolSize: number of worker bands (default 10); total workers = maxConcurrencyPerIp × proxyPoolSize
  • maxConcurrencyPerHost: cap simultaneous requests per website host (default 5; set 0 to disable)
  • dedupeScope: global (one row per value) or perStartingUrl (same value allowed under different seeds)
  • proxyConfiguration: Apify Proxy or custom proxy settings (RESIDENTIAL recommended on Apify)
  • additionalPaths / excludeKeywords: add depth-1 paths and filter URLs

Concurrency and proxy

  • Worker bands: proxyPoolSize × maxConcurrencyPerIp async workers (default 500) share a global crawl queue.
  • Per-request IP rotation: when Apify Proxy is enabled, every HTTP request uses a new residential proxy session (session_id is unique per fetch). Worker bands organize parallelism; they do not pin 10 fixed IPs.
  • Per-host limit: maxConcurrencyPerHost reduces hammering a single domain when many seeds or pages target the same host.
  • Cost: high concurrency with residential proxies can be expensive; lower maxConcurrencyPerIp or proxyPoolSize if you hit rate limits or budget limits.

Notes

  • The crawler stays on the same host or subdomain family as the seed URL, and also follows links on other hosts seen in that crawl branch (common for Shopify: *.myshopify.com seed redirecting to a custom domain while HTML still links to myshopify.com pages).
  • Static assets, mailto:, tel:, javascript:, and fragment-only links are ignored for crawling.
  • additionalPaths are applied when the seed page at depth 0 is fetched, so they become depth-1 pages alongside links discovered from that page. excludeKeywords blocks matching URLs at every depth.
  • 429 responses trigger host-specific cooldowns and respect Retry-After. Lower concurrency if a site still rate limits heavily.
  • Local runs work without Apify Proxy credentials; on Apify, the actor uses the residential proxy pool when available.
  • Default run options: 2-hour timeout, 8 GB memory (see .actor/actor.json). Increase timeout for very large seed lists and depth.

Local run

python -m pip install -r requirements.txt
python -m src

For local testing, put an INPUT.json file under storage/key_value_stores/default/ or set APIFY_LOCAL_STORAGE_DIR to a folder with that structure.

After a run, check storage/datasets/default/ for dataset rows and storage/key_value_stores/default/contacts.json and pages-scraped.json for aggregated JSON files.

Publish to Apify

apify login
apify push

Smoke-test with a few startUrls and depthOfPages=1, then scale up gradually before running ~1,000 seeds at full concurrency.