Website Key Pages Finder
Pricing
Pay per event
Website Key Pages Finder
Find key pages (pricing, docs, status, security, privacy, terms) on any website. Crawls start URLs and returns structured URLs with confidence scores and evidence. Great for competitor analysis, lead enrichment, and audits.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Howard
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
Automatically find and scrape key pages from any website: pricing pages, documentation, status pages, security information, privacy policies, and terms of service. This Apify Actor crawls websites intelligently and returns structured data with confidence scores for each discovered page.
🔍 What does Website Key Pages Finder do?
Website Key Pages Finder is an Apify Actor that automatically discovers important pages on any website. Given a list of URLs, it crawls each site and returns the URLs of six key page types along with confidence scores and evidence explaining how each page was found.
Key page types discovered:
- Pricing - Plans, costs, and billing information
- Documentation - API docs, guides, and developer resources
- Status - System uptime and incident pages
- Security - Trust centers, compliance, and security policies
- Privacy - Privacy policies and data protection information
- Terms - Terms of service and legal agreements
The Actor uses a multi-phase discovery approach that combines URL pattern probing, homepage link extraction, and intelligent crawling to find pages even on sites with non-standard structures.
🎯 Why scrape key pages from websites?
Finding key pages manually across dozens or hundreds of websites is time-consuming and error-prone. This Actor automates the process, making it valuable for:
- Competitor Analysis - Quickly gather pricing pages and documentation from competitor websites to understand their offerings and positioning
- Sales Intelligence - Enrich lead data with links to company pricing, security, and compliance pages before outreach
- Website Auditing - Verify that your own sites have discoverable key pages and assess how competitors structure their information architecture
- Market Research - Collect pricing pages across an industry to analyze pricing trends and strategies
- Due Diligence - Gather legal documents (privacy policies, terms) from potential partners or acquisition targets
- Compliance Monitoring - Track privacy policy and terms changes across a portfolio of vendors
🚀 How to use Website Key Pages Finder
Follow these steps to find key pages on any website:
- Open the Actor in Apify Console or via the API
- Add your URLs to the Start URLs field (homepage URLs work best)
- Configure options (optional) - adjust crawl depth, page limits, and timeout as needed
- Run the Actor by clicking "Start" or calling the API
- Download results from the Dataset tab in JSON, CSV, or Excel format
Example Input
{"startUrls": [{ "url": "https://apify.com" },{ "url": "https://stripe.com" },{ "url": "https://github.com" }],"maxDepth": 1,"maxPagesPerDomain": 12,"includeSubdomains": true}
💰 How much does it cost to find key pages?
Website Key Pages Finder uses Pay Per Event (PPE) pricing, so you only pay for the websites you analyze.
Pricing:
- Per website analyzed: $0.005 per site
- Start fee: $0.005 per run
- No hidden compute costs - the price per website includes all crawling and processing
Cost control:
- Set a maximum spend per run in the Actor input to limit costs
- The Actor stops gracefully when your spending limit is reached
- Remaining URLs are skipped (not charged) when limit is hit
Example costs:
| Websites | Cost |
|---|---|
| 10 | $0.055 |
| 100 | $0.505 |
| 1,000 | $5.005 |
Free tier: Apify provides a free tier with monthly credits, typically sufficient for testing and small-scale usage. Check your Apify account for current free tier limits.
📥 Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs of websites to analyze. Each URL should be a homepage or any page from the domain. |
maxDepth | integer | 1 | Crawl depth. 0 = homepage only, 1 = homepage + priority pages (recommended). |
maxPagesPerDomain | integer | 12 | Maximum pages to fetch per domain. Controls costs and processing time. |
includeSubdomains | boolean | true | Whether to include subdomains when discovering pages (e.g., docs.example.com). |
returnTopN | integer | 1 | Number of top candidates to return per page type. Set higher to see alternative candidates. |
timeoutSecs | integer | 30 | Timeout in seconds for processing each site. |
proxyConfiguration | object | { "useApifyProxy": false } | Proxy settings for sites that block direct access. |
debug | boolean | false | Include debug information (raw candidates) in output. |
📤 Output
Each website produces one result object in the dataset:
{"schemaVersion": "1.0.0","inputUrl": "https://apify.com","finalUrl": "https://apify.com/","domain": "apify.com","pages": {"pricing": {"url": "https://apify.com/pricing","confidence": 0.95,"evidence": ["exact_path:/pricing", "anchor:Pricing", "footer_link"]},"docs": {"url": "https://docs.apify.com","confidence": 0.92,"evidence": ["subdomain:docs", "anchor:Documentation"]},"status": {"url": "https://status.apify.com","confidence": 0.88,"evidence": ["subdomain:status", "anchor:Status"]},"security": {"url": "https://apify.com/security","confidence": 0.85,"evidence": ["path_token:security", "footer_link"]},"privacy": {"url": "https://apify.com/privacy-policy","confidence": 0.90,"evidence": ["path_token:privacy", "anchor:Privacy Policy", "footer_link"]},"terms": {"url": "https://apify.com/terms-of-service","confidence": 0.88,"evidence": ["path_token:terms", "anchor:Terms of Service", "footer_link"]}},"crawlStats": {"pagesFetched": 8,"timeMs": 2340,"errors": [],"likelyJsRendered": false},"timestamp": "2024-01-15T10:30:00.000Z"}
Output Fields
| Field | Description |
|---|---|
inputUrl | The URL you provided |
finalUrl | The URL after following redirects |
domain | The root domain extracted from the URL |
pages | Object containing discovered pages for each type |
pages.[type].url | URL of the discovered page |
pages.[type].confidence | Confidence score from 0 to 1 |
pages.[type].evidence | Array of signals that contributed to the score |
crawlStats.pagesFetched | Number of pages fetched during discovery |
crawlStats.timeMs | Processing time in milliseconds |
crawlStats.errors | Any errors encountered during crawling |
crawlStats.likelyJsRendered | Whether the site appears to be JavaScript-rendered |
topCandidates | (Optional) When returnTopN > 1, contains all top candidates per type |
Page Types
| Type | Common URL Patterns | Description |
|---|---|---|
pricing | /pricing, /plans, /price | Pricing and plan information |
docs | /docs, /documentation, /api, /developer | Documentation and API reference |
status | /status, /uptime, status.example.com | System status and uptime pages |
security | /security, /trust, /compliance | Security and compliance information |
privacy | /privacy, /privacy-policy, /data-protection | Privacy policy |
terms | /terms, /tos, /terms-of-service, /legal | Terms of service |
📊 How does confidence scoring work?
Each discovered page includes a confidence score between 0 and 1 that indicates how certain the Actor is that the page is correct.
| Score | Meaning |
|---|---|
| 0.80 - 1.00 | Very confident - strong signals from URL path, anchor text, and page location |
| 0.50 - 0.79 | Probable match - good evidence but some ambiguity |
| 0.30 - 0.49 | Best guess - limited evidence, may need manual verification |
| Below 0.30 | Not returned - insufficient confidence |
Scoring Factors
-
Discovery Source - Base score from how the page was found
- Fast-path (direct URL probe): +0.40
- Homepage link: +0.30
- Depth-1 crawl: +0.20
- Sitemap: +0.10
-
Positive Signals - Added to the score
- Exact path match (e.g.,
/pricing): +0.30 - Token in path (e.g.,
/pricing-plans): +0.20 - Anchor text match: +0.25
- Footer/nav location: +0.12-0.15
- Subdomain match (e.g.,
docs.example.com): +0.25
- Exact path match (e.g.,
-
Verification - Final adjustment after checking page content
- Title matches expected keywords: +0.20
- Content verified: +0.15
- HTTP error: -0.50
- Wrong content type: -0.30
🔗 Integrations and API access
REST API
Run the Actor via the Apify API:
curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~website-key-pages-finder/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{ "url": "https://example.com" }],"maxDepth": 1,"maxPagesPerDomain": 12}'
JavaScript SDK
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('YOUR_USERNAME/website-key-pages-finder').call({startUrls: [{ url: 'https://example.com' },{ url: 'https://another-site.com' }],maxDepth: 1,maxPagesPerDomain: 12});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python SDK
from apify_client import ApifyClientclient = ApifyClient('YOUR_API_TOKEN')run = client.actor('YOUR_USERNAME/website-key-pages-finder').call(run_input={'startUrls': [{'url': 'https://example.com'},{'url': 'https://another-site.com'}],'maxDepth': 1,'maxPagesPerDomain': 12})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
Webhooks and Integrations
Apify supports webhooks to notify your systems when a run completes. You can also integrate with:
- Zapier - Trigger workflows when new data is available
- Make (Integromat) - Build automated pipelines
- Google Sheets - Export results directly to spreadsheets
- Slack - Get notifications when runs complete
See Apify Integrations for more options.
❓ FAQ
What happens if a page type isn't found?
If the Actor cannot find a page type with sufficient confidence (score >= 0.30), that type will be omitted from the pages object in the output. This is normal - not all websites have all page types.
Why are some confidence scores lower than expected?
Confidence scores depend on the signals found during crawling. Sites with non-standard URL structures, unusual navigation, or pages behind authentication may have lower scores. Check the evidence array to understand what signals were detected.
Can this Actor handle JavaScript-rendered websites?
This Actor uses HTTP-based crawling (CheerioCrawler) for speed and efficiency. Sites that heavily rely on JavaScript for rendering may have incomplete results. The output includes a likelyJsRendered flag to indicate when this might be an issue. For such sites, consider using a browser-based scraper.
How do I increase accuracy for specific sites?
- Increase
maxPagesPerDomainto allow more thorough crawling - Set
returnTopN> 1 to see alternative candidates - Enable
debugmode to see all candidates and their scores
What's the difference between maxDepth 0 and 1?
maxDepth: 0- Only analyzes the homepage (fastest, cheapest)maxDepth: 1- Analyzes homepage plus follows promising links (recommended for best results)
Does this work with sites behind login?
No, this Actor only crawls publicly accessible pages. It cannot handle authentication or login flows.
⚖️ Is it legal to scrape websites for key pages?
Web scraping legality varies by jurisdiction and use case. When using this Actor:
- Respect robots.txt - The Actor follows standard web crawling conventions
- Review Terms of Service - Some websites explicitly prohibit scraping in their ToS
- Use reasonable rate limits - The Actor includes delays to avoid overwhelming servers
- Public data only - Only scrape publicly accessible information
- Intended use - Ensure your use case complies with applicable laws (GDPR, CCPA, etc.)
This Actor is designed for legitimate business purposes such as competitive research, lead enrichment, and website auditing. Users are responsible for ensuring their use complies with applicable laws and website terms of service.
Disclaimer: This information is not legal advice. Consult with a legal professional for guidance specific to your jurisdiction and use case.
⚠️ Limitations
- JavaScript-rendered content - Uses HTTP-based crawling (CheerioCrawler), so heavily JavaScript-rendered sites may have incomplete results. Check the
likelyJsRenderedflag. - Rate limiting - Some sites may block rapid requests. The Actor includes retry logic, but sites with aggressive anti-bot measures may cause failures.
- Page budget - Limited to
maxPagesPerDomainfetches per site to control costs. Increase this for complex sites. - Crawl depth - Currently supports depth 0 (homepage only) or depth 1 (homepage + one level). Deep recursive crawling is not supported.
- Authentication - Cannot access pages behind login or authentication.
🔄 Related Actors
Looking for more web scraping solutions? Check out these related Actors:
- Website Content Crawler - Extract all text content from websites
- Web Scraper - General-purpose web scraping with custom selectors
- Cheerio Scraper - Fast HTTP-based scraping for static sites
📚 Resources and support
- Apify Platform Documentation - Learn how to use Apify
- Report Issues - Found a bug? Let us know
- Apify Discord - Join the community for help and discussions
🛠️ Local Development
Prerequisites
- Node.js 18+
- npm
Setup
# Install dependenciesnpm install# Run locallyapify run# Run with custom inputapify run --input='{"startUrls":[{"url":"https://example.com"}]}'
Deploy
# Login to Apifyapify login# Deploy to Apify platformapify push