Website Contact & Social Discovery Crawler avatar

Website Contact & Social Discovery Crawler

Under maintenance

Pricing

Pay per usage

Go to Apify Store
Website Contact & Social Discovery Crawler

Website Contact & Social Discovery Crawler

Under maintenance

High-throughput crawler that extracts emails, phone numbers, and social media profiles from websites using HTTP-first Crawlee crawling with Selectolax parsing and Playwright SPA fallback.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Man Mohit verma

Man Mohit verma

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

4 days ago

Last modified

Share

What does Website Contact & Social Discovery Crawler do?

This Actor crawls websites and discovers contact details and social media profiles. For each seed URL you provide, it searches high-value pages (contact, about, team, support, and pages found in sitemaps), extracts emails, phone numbers, and social links, and writes every finding as a separate dataset record.

Use it for lead generation, company research, enrichment pipelines, or building contact databases from public web pages.

Features

  • Discover emails, phone numbers, and social profiles (LinkedIn, X/Twitter, Facebook, Instagram, YouTube, TikTok, Threads, GitHub, and more)
  • Crawl multiple websites in one run
  • Sitemap discovery — finds contact-related pages faster via robots.txt and sitemap.xml
  • Multi-site friendly — balances load across domains with round-robin scheduling and per-host rate limits
  • Direct-first proxy — direct requests first; proxy after HTTP 403/429 when configured, otherwise the site is skipped
  • Event-based output — one row per discovered email, phone, or social URL

Input

Configure the Actor in the Input tab. Main fields:

FieldDescription
websitesRequired. One or more website URLs to crawl. Each entry may be a URL string or { "url": "…", "countryCode": "IN" } for phone parsing.
defaultCountryCodeDefault ISO country code for phone parsing when a website entry omits countryCode (default: US).
maxPagesPerSiteMaximum pages to crawl per website (default: 25).
maxDepthPerSiteMaximum link hops from the seed URL (default: 10; 0 = seed pages only).
terminationStrategyearly stops when email, phone, and social are found; lazy crawls until page/depth limits (default: early).
maxConcurrencyMax parallel requests across all sites (default: 10).
maxConcurrencyPerDomainMax in-flight requests per host (default: 2).
maxRequestsPerDomainPerSecondPer-domain request rate limit (default: 2). Lower if you see HTTP 429 errors.
minEnqueueScoreHow selective the crawler is when following links (default: 0.333, raw ≥ 50 on the /160 scale when semantic scoring is active, /120 otherwise). Higher = fewer, more contact-focused pages.
useSemanticScoringImproves link selection on sites with generic URLs and descriptive link text (default: false).
useSitemapDiscoveryResolve redirects and import URLs from robots.txt / sitemap.xml before crawling (default: true).
maxSitemapUrlsCap on sitemap URLs imported per site (default: 50).
treatSubdomainsAsSameSiteFollow links on subdomains of the same brand domain (default: false).
additionalPathsExtra path suffixes probed per site (e.g. contact and policy pages).
proxyConfigurationOptional. Direct first; proxy after HTTP 403/429 when set. Sites without proxy are skipped on 403/429. Sessions rotate on 403/429.
maxProxySessionsMax active proxy sessions at once (default: 10). Domains share sessions; rotation moves every domain on that session together.

Website examples

[
"https://www.apify.com",
"https://example.com"
]

With per-site phone region (recommended for non-US sites):

[
{ "url": "https://www.kalyansilks.com/", "countryCode": "IN" },
{ "url": "https://example.co.uk/", "countryCode": "GB" }
]

URLs with optional object form (uses defaultCountryCode when countryCode is omitted):

[
{ "url": "https://www.apify.com" }
]

Output

Each discovered entity is saved as one dataset record. Download results as JSON, CSV, Excel, HTML, XML, or RSS from the run's Storage tab.

Output fields

FieldDescription
startingUrlThe seed URL you provided for this website
currentPageThe page where the entity was found
pageFetchedThe actual URL that was fetched (may differ after redirects)
typeEntity type: email, phone, twitter, linkedin, facebook, instagram, youtube, tiktok, threads, github, whatsapp, telegram, discord, or contact_form
valueThe extracted email, phone number, social profile URL, or contact page URL

Output example

{
"startingUrl": "https://www.example.com/",
"currentPage": "https://www.example.com/contact-us",
"pageFetched": "https://www.example.com/contact-us",
"type": "email",
"value": "hello@example.com"
}
{
"startingUrl": "https://www.example.com/",
"currentPage": "https://www.example.com/about",
"pageFetched": "https://www.example.com/about",
"type": "linkedin",
"value": "https://www.linkedin.com/company/example"
}

Tips

  • Start with a low maxPagesPerSite when testing new domains.
  • Set countryCode (or defaultCountryCode) to match each site's market so local phone numbers parse correctly.
  • Use terminationStrategy: "lazy" to collect more contacts within your page and depth limits.
  • Use terminationStrategy: "early" (default) for faster runs when one email, phone, and social per site is enough.
  • Set proxyConfiguration if sites return HTTP 403 or 429 without proxy.
  • Lower maxRequestsPerDomainPerSecond or maxConcurrencyPerDomain if you encounter rate limiting (HTTP 429).
  • Set useSitemapDiscovery to false if you only want to crawl pages discovered via links from the homepage.

Limitations

  • Extracts only publicly visible contact information on crawled pages.
  • Phone numbers without a country code need the correct countryCode or defaultCountryCode for your target market.
  • Some sites block automated access; proxy may be required.
  • Respects maxPagesPerSite, maxDepthPerSite, and termination strategy; lazy mode still does not guarantee every contact on large sites.