Adaptive Website Lead Extractor avatar

Adaptive Website Lead Extractor

Pricing

Pay per event

Go to Apify Store
Adaptive Website Lead Extractor

Adaptive Website Lead Extractor

Crawl public business websites with Scrapling to extract emails, phones, social profiles, contact pages, automation gaps, and lead scores for CRM-ready outreach.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Adaptive Website Lead Extractor is an Apify Actor that turns public business websites into structured lead intelligence records.

It crawls one or more websites, inspects a limited number of same-site pages, and returns one clean dataset item per input website with contact details, social profiles, contact-page signals, automation gaps, media assets when enabled, lead score, confidence, and crawl summary.

The Actor uses Scrapling as the core scraping and parsing engine. Scrapling is used for fetching pages, parsing HTML, selector-based extraction, adaptive element lookup where useful, and optional stealth fetching for public pages that need browser-style rendering.

This is not a generic Scrapling wrapper. Scrapling is the engine; the product is lead intelligence for agencies, sales teams, CRM enrichment, and automation workflows.

What It Extracts

  • Company name estimate
  • Page title and meta description
  • Public emails and phone numbers
  • Primary email and primary phone
  • Contact page and about page URLs
  • Social profile links
  • Address-like text when confidently detected
  • Contact form signals
  • Booking, chat, WhatsApp, and contact automation signals
  • Opportunity signals for missing contact automation
  • Optional image, video, document, audio, archive, and embed URLs
  • Lead score from 0 to 100
  • Extraction confidence from 0 to 1
  • Pages crawled, source pages, and non-fatal errors

Best Use Cases

  • Enrich company website lists with public contact data
  • Find businesses with weak contact or booking infrastructure
  • Build review queues for AI receptionist, local SEO, web design, or CRM automation outreach
  • Send structured lead records into n8n, Make, Zapier, Google Sheets, Airtable, HubSpot, Pipedrive, or a custom CRM
  • Discover public media and document URLs referenced by crawled pages when media mode is enabled

Input

{
"startUrls": [
{
"url": "https://example.com"
}
],
"maxPagesPerDomain": 20,
"maxConcurrency": 5,
"useStealth": false,
"respectRobotsTxt": true,
"extractEmails": true,
"extractPhones": true,
"extractSocialLinks": true,
"extractContactPages": true,
"extractAutomationSignals": true,
"extractMedia": false,
"extractImages": true,
"extractVideos": true,
"extractDocuments": true,
"extractOtherMedia": true,
"maxMediaPerDomain": 100,
"crawlSameDomainOnly": true,
"requestTimeoutSecs": 30,
"proxyConfiguration": {
"useApifyProxy": false
}
}

Important Input Options

FieldDefaultDescription
startUrlsrequiredWebsites or domains to crawl.
maxPagesPerDomain20Hard limit for pages crawled per input website.
maxConcurrency5Number of websites processed in parallel.
useStealthfalseUses Scrapling's stealth browser fetcher. Slower, intended only for public pages that need browser rendering.
respectRobotsTxttrueSkips URLs disallowed by robots.txt.
extractAutomationSignalstrueDetects public booking, form, chat, and WhatsApp signals.
extractMediafalseEnables media URL discovery. Files are not downloaded.
maxMediaPerDomain100Maximum media asset URLs returned per website.
crawlSameDomainOnlytrueStays on the same normalized host. docs.example.com does not crawl blog.example.com.
proxyConfigurationdisabledOptional Apify Proxy configuration for public sites that rate-limit datacenter traffic.

Media Extraction

Media discovery is disabled by default because the Actor is primarily a lead intelligence tool.

Set extractMedia to true to collect public URLs for:

  • images from img, source, srcset, Open Graph, Twitter image, icons, and CSS url(...)
  • videos from video tags and public embeds such as YouTube, Vimeo, Wistia, Loom, and Vidyard
  • documents such as PDF, DOCX, PPTX, XLSX, CSV, and TXT
  • audio files, archives, and other recognized media file URLs

The Actor records media URLs only. It does not download, store, transform, or rehost media files.

Output

The Actor pushes one item per input website to the default dataset and stores a run summary in the default Key-Value Store under OUTPUT_SUMMARY.

Example dataset item:

{
"startUrl": "https://example.com",
"domain": "example.com",
"siteHost": "example.com",
"companyName": "Example GmbH",
"title": "Example GmbH - Digital Services",
"description": "Example company description...",
"primaryEmail": "info@example.com",
"primaryPhone": "+49 30 123456",
"emails": ["info@example.com"],
"phones": ["+49 30 123456"],
"socialLinks": {
"linkedin": "https://linkedin.com/company/example",
"instagram": "https://instagram.com/example"
},
"mediaSummary": {
"images": 12,
"videos": 1,
"documents": 2,
"other": 0,
"total": 15
},
"mediaAssets": {
"images": [
{
"url": "https://example.com/assets/logo.png",
"sourcePage": "https://example.com",
"extension": "png"
}
],
"videos": [
{
"url": "https://www.youtube.com/embed/example",
"sourcePage": "https://example.com"
}
],
"documents": [
{
"url": "https://example.com/company-brochure.pdf",
"sourcePage": "https://example.com/about",
"extension": "pdf"
}
],
"other": []
},
"contactPage": "https://example.com/contact",
"aboutPage": "https://example.com/about",
"addressLikeText": ["Example Street 12, 10115 Berlin"],
"contactMethods": {
"hasEmail": true,
"hasPhone": true,
"hasContactPage": true,
"hasContactForm": true,
"hasSocialProfile": true
},
"automationSignals": {
"hasOnlineBooking": false,
"hasChatWidget": false,
"hasContactForm": true,
"hasWhatsappLink": false
},
"opportunitySignals": {
"missingOnlineBooking": true,
"missingChatWidget": true,
"missingWhatsappLink": true,
"missingContactForm": false,
"hasMessagingGap": true,
"hasAutomationGap": true
},
"siteClassification": {
"type": "business_website",
"businessWebsiteLikely": true,
"reason": "Business contact or outreach signals were detected on crawled pages."
},
"recommendedAction": "Prioritize outbound: public email found and automation gap detected.",
"leadScore": 78,
"leadScoreLabel": "high",
"confidence": 0.84,
"confidenceLabel": "high",
"confidenceReasons": [
"Crawled 12 public page(s).",
"Public email address found.",
"Public phone number found.",
"Likely contact page found.",
"Company identity inferred from page metadata, title, schema, logo, or domain."
],
"pagesCrawled": 12,
"errors": [],
"sourcePages": [
"https://example.com",
"https://example.com/contact",
"https://example.com/about"
],
"crawlSummary": {
"pagesCrawled": 12,
"emailsFound": 1,
"phonesFound": 1,
"socialProfilesFound": 2,
"mediaAssetsFound": 15,
"contactPageFound": true,
"aboutPageFound": true,
"errorsFound": 0
}
}

Output Fields

FieldDescription
domainRegistered domain, for example example.com.
siteHostActual host crawled, for example docs.example.com.
companyNameBest-effort company name from title, metadata, schema, logo alt text, or domain.
primaryEmail, primaryPhoneFirst selected contact candidates for workflow-friendly use.
emails, phonesDeduplicated public contact data found on crawled pages.
socialLinksPublic social profile URLs grouped by platform.
mediaSummary, mediaAssetsMedia counts and URLs when extractMedia is enabled.
contactMethodsBoolean summary of reachable contact methods.
automationSignalsDetected booking, chat, form, and WhatsApp signals.
opportunitySignalsMissing automation/contact signals useful for outreach review.
siteClassificationBest-effort site type classification: business_website, documentation, blog, ecommerce, or unknown.
leadScoreTransparent opportunity score from 0 to 100.
confidence, confidenceReasonsExtraction confidence from 0 to 1 and short reasons explaining the confidence.
engine, engineRepositoryScraping engine metadata for auditability and workflow routing.
crawlSummaryCompact summary for dashboards and automation filters.

Reliability

  • Uses an input schema so Apify validates required input before the run starts.
  • Uses an output schema so users, API clients, and AI agents know where to find results.
  • Pushes one dataset item per input website, even when no contact data is found.
  • Fails gracefully per URL and records non-fatal crawl errors in the output item.
  • Stores a run-level OUTPUT_SUMMARY record in the default Key-Value Store.
  • Uses bounded crawling with maxPagesPerDomain, maxConcurrency, and request timeouts.
  • Runs under Apify limited permissions and does not require account credentials.

Automated Test Readiness

Apify's automated Store test expects the Actor's default/prefilled input to finish successfully and produce a non-empty default dataset within a short time window.

Recommended smoke-test input:

{
"startUrls": [
{
"url": "https://docs.apify.com/"
}
],
"maxPagesPerDomain": 10,
"maxConcurrency": 1,
"respectRobotsTxt": true,
"extractMedia": false
}

Expected smoke-test result:

  • run status: succeeded
  • default dataset: non-empty
  • one domain-level item pushed
  • no uncaught ReferenceError, TypeError, or Python traceback
  • OUTPUT_SUMMARY present in the Key-Value Store

Ease of Use

  • Provides form-friendly input controls for URLs, crawling limits, concurrency, robots.txt, contact extraction, automation signals, media extraction, timeout, and proxy settings.
  • Uses conservative defaults for normal public website enrichment.
  • Keeps media extraction disabled by default to reduce output size and cost.
  • Returns CRM-friendly fields such as primaryEmail, primaryPhone, leadScore, confidence, recommendedAction, and crawlSummary.

Trust and Safety

  • Crawls public pages only.
  • Respects robots.txt when enabled.
  • Avoids authenticated, private, checkout, account, and obvious sensitive paths.
  • Does not submit forms.
  • Does not solve CAPTCHAs.
  • Does not perform aggressive anti-bot bypassing.
  • Does not download or rehost media files; media mode records public URLs only.

Congruency

The Actor title, description, input schema, output schema, dataset view, README, and monetization events use the same terminology:

  • website/domain lead record
  • basic lead record
  • qualified lead record
  • media assets
  • automation signals
  • lead score
  • confidence
  • crawl summary

This consistency is intentional because Apify's quality score considers whether an Actor's text, schemas, and behavior align.

Lead Score

The score is intentionally simple and transparent. It is an outreach opportunity score, not a business quality score.

Example scoring factors:

  • +20 email found
  • +20 phone found
  • +15 contact page found
  • +10 social profile found
  • +10 contact form found
  • +15 appointment-based business appears to lack online booking
  • +15 no chat, WhatsApp, or similar messaging automation detected
  • capped at 100

Use the score for prioritization and human review, not automated eligibility decisions.

Confidence

Confidence is separate from lead score. It increases when more useful pages are crawled, contact/about pages are found, contact details are detected, and multiple signals confirm the company identity. It decreases when pages fail, data is sparse, or identity/contact signals are weak.

CRM Enrichment

  1. Upload a list of company websites.
  2. Extract email, phone, social profiles, contact page, score, and confidence.
  3. Export the dataset to CSV or send it to HubSpot, Pipedrive, Airtable, or Google Sheets.

AI Receptionist or Booking Automation Leads

Filter for websites with:

  • phone number present
  • email or contact page present
  • opportunitySignals.hasAutomationGap = true
  • missing booking, chat, WhatsApp, or contact form

n8n Automation

  1. Trigger this Actor from n8n.
  2. Read the default dataset items.
  3. Filter by leadScore, confidence, and opportunitySignals.
  4. Send qualified records to Google Sheets, Airtable, HubSpot, Slack, or an outreach queue.

Example filter:

leadScore >= 70
AND confidence >= 0.7
AND opportunitySignals.hasAutomationGap = true

Proxy and Stealth Use

The default run does not use Apify Proxy and does not use stealth fetching.

For larger public crawls or sites that rate-limit datacenter traffic, enable Apify Proxy:

{
"proxyConfiguration": {
"useApifyProxy": true
}
}

Use useStealth: true only when public pages need browser rendering. This Actor does not solve CAPTCHAs, submit forms, scrape authenticated content, or perform aggressive anti-bot bypassing.

Performance Tips

  • Keep maxPagesPerDomain between 5 and 20 for quick enrichment.
  • Use 20 to 50 pages for deeper lead analysis.
  • Disable extractMedia unless you need media URLs.
  • Keep crawlSameDomainOnly enabled for cleaner results.
  • Use moderate concurrency for large input lists.
  • Enable proxy only when needed.

Limitations

Websites vary widely. Some sites hide contact details behind JavaScript, publish contact data as images, block automated requests, use ambiguous phone/address formats, or disallow crawling in robots.txt.

Automation signals are best-effort public-page signals. They should be treated as review hints, not guarantees.

Ethical Usage

Use this Actor only on public web pages and for legitimate business purposes. Respect robots.txt when enabled and comply with applicable privacy, marketing, platform, and data protection rules.

Do not use this Actor for spam, harassment, credential collection, sensitive profiling, scraping private or authenticated data, bypassing access restrictions, or deceptive outreach.

Always review leads before contacting them.