Adaptive Website Lead Extractor
Pricing
Pay per event
Adaptive Website Lead Extractor
Crawl public business websites with Scrapling to extract emails, phones, social profiles, contact pages, automation gaps, and lead scores for CRM-ready outreach.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Solutions Smart
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Adaptive Website Lead Extractor is an Apify Actor that turns public business websites into structured lead intelligence records.
It crawls one or more websites, inspects a limited number of same-site pages, and returns one clean dataset item per input website with contact details, social profiles, contact-page signals, automation gaps, media assets when enabled, lead score, confidence, and crawl summary.
The Actor uses Scrapling as the core scraping and parsing engine. Scrapling is used for fetching pages, parsing HTML, selector-based extraction, adaptive element lookup where useful, and optional stealth fetching for public pages that need browser-style rendering.
This is not a generic Scrapling wrapper. Scrapling is the engine; the product is lead intelligence for agencies, sales teams, CRM enrichment, and automation workflows.
What It Extracts
- Company name estimate
- Page title and meta description
- Public emails and phone numbers
- Primary email and primary phone
- Contact page and about page URLs
- Social profile links
- Address-like text when confidently detected
- Contact form signals
- Booking, chat, WhatsApp, and contact automation signals
- Opportunity signals for missing contact automation
- Optional image, video, document, audio, archive, and embed URLs
- Lead score from
0to100 - Extraction confidence from
0to1 - Pages crawled, source pages, and non-fatal errors
Best Use Cases
- Enrich company website lists with public contact data
- Find businesses with weak contact or booking infrastructure
- Build review queues for AI receptionist, local SEO, web design, or CRM automation outreach
- Send structured lead records into n8n, Make, Zapier, Google Sheets, Airtable, HubSpot, Pipedrive, or a custom CRM
- Discover public media and document URLs referenced by crawled pages when media mode is enabled
Input
{"startUrls": [{"url": "https://example.com"}],"maxPagesPerDomain": 20,"maxConcurrency": 5,"useStealth": false,"respectRobotsTxt": true,"extractEmails": true,"extractPhones": true,"extractSocialLinks": true,"extractContactPages": true,"extractAutomationSignals": true,"extractMedia": false,"extractImages": true,"extractVideos": true,"extractDocuments": true,"extractOtherMedia": true,"maxMediaPerDomain": 100,"crawlSameDomainOnly": true,"requestTimeoutSecs": 30,"proxyConfiguration": {"useApifyProxy": false}}
Important Input Options
| Field | Default | Description |
|---|---|---|
startUrls | required | Websites or domains to crawl. |
maxPagesPerDomain | 20 | Hard limit for pages crawled per input website. |
maxConcurrency | 5 | Number of websites processed in parallel. |
useStealth | false | Uses Scrapling's stealth browser fetcher. Slower, intended only for public pages that need browser rendering. |
respectRobotsTxt | true | Skips URLs disallowed by robots.txt. |
extractAutomationSignals | true | Detects public booking, form, chat, and WhatsApp signals. |
extractMedia | false | Enables media URL discovery. Files are not downloaded. |
maxMediaPerDomain | 100 | Maximum media asset URLs returned per website. |
crawlSameDomainOnly | true | Stays on the same normalized host. docs.example.com does not crawl blog.example.com. |
proxyConfiguration | disabled | Optional Apify Proxy configuration for public sites that rate-limit datacenter traffic. |
Media Extraction
Media discovery is disabled by default because the Actor is primarily a lead intelligence tool.
Set extractMedia to true to collect public URLs for:
- images from
img,source,srcset, Open Graph, Twitter image, icons, and CSSurl(...) - videos from video tags and public embeds such as YouTube, Vimeo, Wistia, Loom, and Vidyard
- documents such as PDF, DOCX, PPTX, XLSX, CSV, and TXT
- audio files, archives, and other recognized media file URLs
The Actor records media URLs only. It does not download, store, transform, or rehost media files.
Output
The Actor pushes one item per input website to the default dataset and stores a run summary in the default Key-Value Store under OUTPUT_SUMMARY.
Example dataset item:
{"startUrl": "https://example.com","domain": "example.com","siteHost": "example.com","companyName": "Example GmbH","title": "Example GmbH - Digital Services","description": "Example company description...","primaryEmail": "info@example.com","primaryPhone": "+49 30 123456","emails": ["info@example.com"],"phones": ["+49 30 123456"],"socialLinks": {"linkedin": "https://linkedin.com/company/example","instagram": "https://instagram.com/example"},"mediaSummary": {"images": 12,"videos": 1,"documents": 2,"other": 0,"total": 15},"mediaAssets": {"images": [{"url": "https://example.com/assets/logo.png","sourcePage": "https://example.com","extension": "png"}],"videos": [{"url": "https://www.youtube.com/embed/example","sourcePage": "https://example.com"}],"documents": [{"url": "https://example.com/company-brochure.pdf","sourcePage": "https://example.com/about","extension": "pdf"}],"other": []},"contactPage": "https://example.com/contact","aboutPage": "https://example.com/about","addressLikeText": ["Example Street 12, 10115 Berlin"],"contactMethods": {"hasEmail": true,"hasPhone": true,"hasContactPage": true,"hasContactForm": true,"hasSocialProfile": true},"automationSignals": {"hasOnlineBooking": false,"hasChatWidget": false,"hasContactForm": true,"hasWhatsappLink": false},"opportunitySignals": {"missingOnlineBooking": true,"missingChatWidget": true,"missingWhatsappLink": true,"missingContactForm": false,"hasMessagingGap": true,"hasAutomationGap": true},"siteClassification": {"type": "business_website","businessWebsiteLikely": true,"reason": "Business contact or outreach signals were detected on crawled pages."},"recommendedAction": "Prioritize outbound: public email found and automation gap detected.","leadScore": 78,"leadScoreLabel": "high","confidence": 0.84,"confidenceLabel": "high","confidenceReasons": ["Crawled 12 public page(s).","Public email address found.","Public phone number found.","Likely contact page found.","Company identity inferred from page metadata, title, schema, logo, or domain."],"pagesCrawled": 12,"errors": [],"sourcePages": ["https://example.com","https://example.com/contact","https://example.com/about"],"crawlSummary": {"pagesCrawled": 12,"emailsFound": 1,"phonesFound": 1,"socialProfilesFound": 2,"mediaAssetsFound": 15,"contactPageFound": true,"aboutPageFound": true,"errorsFound": 0}}
Output Fields
| Field | Description |
|---|---|
domain | Registered domain, for example example.com. |
siteHost | Actual host crawled, for example docs.example.com. |
companyName | Best-effort company name from title, metadata, schema, logo alt text, or domain. |
primaryEmail, primaryPhone | First selected contact candidates for workflow-friendly use. |
emails, phones | Deduplicated public contact data found on crawled pages. |
socialLinks | Public social profile URLs grouped by platform. |
mediaSummary, mediaAssets | Media counts and URLs when extractMedia is enabled. |
contactMethods | Boolean summary of reachable contact methods. |
automationSignals | Detected booking, chat, form, and WhatsApp signals. |
opportunitySignals | Missing automation/contact signals useful for outreach review. |
siteClassification | Best-effort site type classification: business_website, documentation, blog, ecommerce, or unknown. |
leadScore | Transparent opportunity score from 0 to 100. |
confidence, confidenceReasons | Extraction confidence from 0 to 1 and short reasons explaining the confidence. |
engine, engineRepository | Scraping engine metadata for auditability and workflow routing. |
crawlSummary | Compact summary for dashboards and automation filters. |
Reliability
- Uses an input schema so Apify validates required input before the run starts.
- Uses an output schema so users, API clients, and AI agents know where to find results.
- Pushes one dataset item per input website, even when no contact data is found.
- Fails gracefully per URL and records non-fatal crawl errors in the output item.
- Stores a run-level
OUTPUT_SUMMARYrecord in the default Key-Value Store. - Uses bounded crawling with
maxPagesPerDomain,maxConcurrency, and request timeouts. - Runs under Apify limited permissions and does not require account credentials.
Automated Test Readiness
Apify's automated Store test expects the Actor's default/prefilled input to finish successfully and produce a non-empty default dataset within a short time window.
Recommended smoke-test input:
{"startUrls": [{"url": "https://docs.apify.com/"}],"maxPagesPerDomain": 10,"maxConcurrency": 1,"respectRobotsTxt": true,"extractMedia": false}
Expected smoke-test result:
- run status: succeeded
- default dataset: non-empty
- one domain-level item pushed
- no uncaught
ReferenceError,TypeError, or Python traceback OUTPUT_SUMMARYpresent in the Key-Value Store
Ease of Use
- Provides form-friendly input controls for URLs, crawling limits, concurrency, robots.txt, contact extraction, automation signals, media extraction, timeout, and proxy settings.
- Uses conservative defaults for normal public website enrichment.
- Keeps media extraction disabled by default to reduce output size and cost.
- Returns CRM-friendly fields such as
primaryEmail,primaryPhone,leadScore,confidence,recommendedAction, andcrawlSummary.
Trust and Safety
- Crawls public pages only.
- Respects robots.txt when enabled.
- Avoids authenticated, private, checkout, account, and obvious sensitive paths.
- Does not submit forms.
- Does not solve CAPTCHAs.
- Does not perform aggressive anti-bot bypassing.
- Does not download or rehost media files; media mode records public URLs only.
Congruency
The Actor title, description, input schema, output schema, dataset view, README, and monetization events use the same terminology:
- website/domain lead record
- basic lead record
- qualified lead record
- media assets
- automation signals
- lead score
- confidence
- crawl summary
This consistency is intentional because Apify's quality score considers whether an Actor's text, schemas, and behavior align.
Lead Score
The score is intentionally simple and transparent. It is an outreach opportunity score, not a business quality score.
Example scoring factors:
+20email found+20phone found+15contact page found+10social profile found+10contact form found+15appointment-based business appears to lack online booking+15no chat, WhatsApp, or similar messaging automation detected- capped at
100
Use the score for prioritization and human review, not automated eligibility decisions.
Confidence
Confidence is separate from lead score. It increases when more useful pages are crawled, contact/about pages are found, contact details are detected, and multiple signals confirm the company identity. It decreases when pages fail, data is sparse, or identity/contact signals are weak.
Recommended Workflows
CRM Enrichment
- Upload a list of company websites.
- Extract email, phone, social profiles, contact page, score, and confidence.
- Export the dataset to CSV or send it to HubSpot, Pipedrive, Airtable, or Google Sheets.
AI Receptionist or Booking Automation Leads
Filter for websites with:
- phone number present
- email or contact page present
opportunitySignals.hasAutomationGap = true- missing booking, chat, WhatsApp, or contact form
n8n Automation
- Trigger this Actor from n8n.
- Read the default dataset items.
- Filter by
leadScore,confidence, andopportunitySignals. - Send qualified records to Google Sheets, Airtable, HubSpot, Slack, or an outreach queue.
Example filter:
leadScore >= 70AND confidence >= 0.7AND opportunitySignals.hasAutomationGap = true
Proxy and Stealth Use
The default run does not use Apify Proxy and does not use stealth fetching.
For larger public crawls or sites that rate-limit datacenter traffic, enable Apify Proxy:
{"proxyConfiguration": {"useApifyProxy": true}}
Use useStealth: true only when public pages need browser rendering. This Actor does not solve CAPTCHAs, submit forms, scrape authenticated content, or perform aggressive anti-bot bypassing.
Performance Tips
- Keep
maxPagesPerDomainbetween5and20for quick enrichment. - Use
20to50pages for deeper lead analysis. - Disable
extractMediaunless you need media URLs. - Keep
crawlSameDomainOnlyenabled for cleaner results. - Use moderate concurrency for large input lists.
- Enable proxy only when needed.
Limitations
Websites vary widely. Some sites hide contact details behind JavaScript, publish contact data as images, block automated requests, use ambiguous phone/address formats, or disallow crawling in robots.txt.
Automation signals are best-effort public-page signals. They should be treated as review hints, not guarantees.
Ethical Usage
Use this Actor only on public web pages and for legitimate business purposes. Respect robots.txt when enabled and comply with applicable privacy, marketing, platform, and data protection rules.
Do not use this Actor for spam, harassment, credential collection, sensitive profiling, scraping private or authenticated data, bypassing access restrictions, or deceptive outreach.
Always review leads before contacting them.