Website Contact Crawler
Pricing
Pay per usage
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Man Mohit verma
Maintained by CommunityActor stats
0
Bookmarked
5
Total users
2
Monthly active users
4 days ago
Last modified
Categories
Share
Python Apify Actor that crawls a list of start URLs, follows links up to a configurable depth, and extracts:
- email addresses
- phone numbers
- Facebook, X/Twitter, WhatsApp, YouTube, Instagram, and LinkedIn links
Each extracted contact is stored with:
startingUrl— seed URL for the crawl branchcurrentPage— URL that was requested and crawled (e.g./pages/contact-us)pageFetched— final URL after HTTP redirects, where the HTML was parsedtypevalue
Output
- Default dataset — one row per unique contact (standard Apify export as JSON/CSV).
- Key-Value Store
contacts.json— full aggregated array of all contacts from the run.pages-scraped.json— per seed URL, all HTML pages that were successfully scraped (startingUrl+pagesScrapedarray).
Input
startUrls: list of seed URLs (JSON array; supports large lists such as ~1,000 sites)depthOfPages: crawl depth from each seed URLdefaultPhoneRegion: default region forphonenumbersmaxConcurrencyPerIp: concurrent fetches per worker band (default 50)proxyPoolSize: number of worker bands (default 10); total workers =maxConcurrencyPerIp×proxyPoolSizemaxConcurrencyPerHost: cap simultaneous requests per website host (default 5; set0to disable)dedupeScope:global(one row per value) orperStartingUrl(same value allowed under different seeds)proxyConfiguration: Apify Proxy or custom proxy settings (RESIDENTIAL recommended on Apify)additionalPaths/excludeKeywords: add depth-1 paths and filter URLs
Concurrency and proxy
- Worker bands:
proxyPoolSize×maxConcurrencyPerIpasync workers (default 500) share a global crawl queue. - Per-request IP rotation: when Apify Proxy is enabled, every HTTP request uses a new residential proxy session (
session_idis unique per fetch). Worker bands organize parallelism; they do not pin 10 fixed IPs. - Per-host limit:
maxConcurrencyPerHostreduces hammering a single domain when many seeds or pages target the same host. - Cost: high concurrency with residential proxies can be expensive; lower
maxConcurrencyPerIporproxyPoolSizeif you hit rate limits or budget limits.
Notes
- The crawler stays on the same host or subdomain family as the seed URL, and also follows links on other hosts seen in that crawl branch (common for Shopify:
*.myshopify.comseed redirecting to a custom domain while HTML still links tomyshopify.compages). - Static assets,
mailto:,tel:,javascript:, and fragment-only links are ignored for crawling. additionalPathsare applied when the seed page at depth 0 is fetched, so they become depth-1 pages alongside links discovered from that page.excludeKeywordsblocks matching URLs at every depth.- 429 responses trigger host-specific cooldowns and respect
Retry-After. Lower concurrency if a site still rate limits heavily. - Local runs work without Apify Proxy credentials; on Apify, the actor uses the residential proxy pool when available.
- Default run options: 2-hour timeout, 8 GB memory (see
.actor/actor.json). Increase timeout for very large seed lists and depth.
Local run
python -m pip install -r requirements.txtpython -m src
For local testing, put an INPUT.json file under storage/key_value_stores/default/ or set APIFY_LOCAL_STORAGE_DIR to a folder with that structure.
After a run, check storage/datasets/default/ for dataset rows and storage/key_value_stores/default/contacts.json and pages-scraped.json for aggregated JSON files.
Publish to Apify
apify loginapify push
Smoke-test with a few startUrls and depthOfPages=1, then scale up gradually before running ~1,000 seeds at full concurrency.