Website Contact & Social Discovery Crawler
Under maintenancePricing
Pay per usage
Website Contact & Social Discovery Crawler
Under maintenanceHigh-throughput crawler that extracts emails, phone numbers, and social media profiles from websites using HTTP-first Crawlee crawling with Selectolax parsing and Playwright SPA fallback.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Man Mohit verma
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
0
Monthly active users
4 days ago
Last modified
Share
What does Website Contact & Social Discovery Crawler do?
This Actor crawls websites and discovers contact details and social media profiles. For each seed URL you provide, it searches high-value pages (contact, about, team, support, and pages found in sitemaps), extracts emails, phone numbers, and social links, and writes every finding as a separate dataset record.
Use it for lead generation, company research, enrichment pipelines, or building contact databases from public web pages.
Features
- Discover emails, phone numbers, and social profiles (LinkedIn, X/Twitter, Facebook, Instagram, YouTube, TikTok, Threads, GitHub, and more)
- Crawl multiple websites in one run
- Sitemap discovery — finds contact-related pages faster via
robots.txtandsitemap.xml - Multi-site friendly — balances load across domains with round-robin scheduling and per-host rate limits
- Direct-first proxy — direct requests first; proxy after HTTP 403/429 when configured, otherwise the site is skipped
- Event-based output — one row per discovered email, phone, or social URL
Input
Configure the Actor in the Input tab. Main fields:
| Field | Description |
|---|---|
websites | Required. One or more website URLs to crawl. Each entry may be a URL string or { "url": "…", "countryCode": "IN" } for phone parsing. |
defaultCountryCode | Default ISO country code for phone parsing when a website entry omits countryCode (default: US). |
maxPagesPerSite | Maximum pages to crawl per website (default: 25). |
maxDepthPerSite | Maximum link hops from the seed URL (default: 10; 0 = seed pages only). |
terminationStrategy | early stops when email, phone, and social are found; lazy crawls until page/depth limits (default: early). |
maxConcurrency | Max parallel requests across all sites (default: 10). |
maxConcurrencyPerDomain | Max in-flight requests per host (default: 2). |
maxRequestsPerDomainPerSecond | Per-domain request rate limit (default: 2). Lower if you see HTTP 429 errors. |
minEnqueueScore | How selective the crawler is when following links (default: 0.333, raw ≥ 50 on the /160 scale when semantic scoring is active, /120 otherwise). Higher = fewer, more contact-focused pages. |
useSemanticScoring | Improves link selection on sites with generic URLs and descriptive link text (default: false). |
useSitemapDiscovery | Resolve redirects and import URLs from robots.txt / sitemap.xml before crawling (default: true). |
maxSitemapUrls | Cap on sitemap URLs imported per site (default: 50). |
treatSubdomainsAsSameSite | Follow links on subdomains of the same brand domain (default: false). |
additionalPaths | Extra path suffixes probed per site (e.g. contact and policy pages). |
proxyConfiguration | Optional. Direct first; proxy after HTTP 403/429 when set. Sites without proxy are skipped on 403/429. Sessions rotate on 403/429. |
maxProxySessions | Max active proxy sessions at once (default: 10). Domains share sessions; rotation moves every domain on that session together. |
Website examples
["https://www.apify.com","https://example.com"]
With per-site phone region (recommended for non-US sites):
[{ "url": "https://www.kalyansilks.com/", "countryCode": "IN" },{ "url": "https://example.co.uk/", "countryCode": "GB" }]
URLs with optional object form (uses defaultCountryCode when countryCode is omitted):
[{ "url": "https://www.apify.com" }]
Output
Each discovered entity is saved as one dataset record. Download results as JSON, CSV, Excel, HTML, XML, or RSS from the run's Storage tab.
Output fields
| Field | Description |
|---|---|
startingUrl | The seed URL you provided for this website |
currentPage | The page where the entity was found |
pageFetched | The actual URL that was fetched (may differ after redirects) |
type | Entity type: email, phone, twitter, linkedin, facebook, instagram, youtube, tiktok, threads, github, whatsapp, telegram, discord, or contact_form |
value | The extracted email, phone number, social profile URL, or contact page URL |
Output example
{"startingUrl": "https://www.example.com/","currentPage": "https://www.example.com/contact-us","pageFetched": "https://www.example.com/contact-us","type": "email","value": "hello@example.com"}
{"startingUrl": "https://www.example.com/","currentPage": "https://www.example.com/about","pageFetched": "https://www.example.com/about","type": "linkedin","value": "https://www.linkedin.com/company/example"}
Tips
- Start with a low
maxPagesPerSitewhen testing new domains. - Set
countryCode(ordefaultCountryCode) to match each site's market so local phone numbers parse correctly. - Use
terminationStrategy: "lazy"to collect more contacts within your page and depth limits. - Use
terminationStrategy: "early"(default) for faster runs when one email, phone, and social per site is enough. - Set
proxyConfigurationif sites return HTTP 403 or 429 without proxy. - Lower
maxRequestsPerDomainPerSecondormaxConcurrencyPerDomainif you encounter rate limiting (HTTP 429). - Set
useSitemapDiscoverytofalseif you only want to crawl pages discovered via links from the homepage.
Limitations
- Extracts only publicly visible contact information on crawled pages.
- Phone numbers without a country code need the correct
countryCodeordefaultCountryCodefor your target market. - Some sites block automated access; proxy may be required.
- Respects
maxPagesPerSite,maxDepthPerSite, and termination strategy; lazy mode still does not guarantee every contact on large sites.