Website Key Pages Finder avatar
Website Key Pages Finder

Pricing

Pay per event

Go to Apify Store
Website Key Pages Finder

Website Key Pages Finder

Find key pages (pricing, docs, status, security, privacy, terms) on any website. Crawls start URLs and returns structured URLs with confidence scores and evidence. Great for competitor analysis, lead enrichment, and audits.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Howard

Howard

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Share

Automatically find and scrape key pages from any website: pricing pages, documentation, status pages, security information, privacy policies, and terms of service. This Apify Actor crawls websites intelligently and returns structured data with confidence scores for each discovered page.

🔍 What does Website Key Pages Finder do?

Website Key Pages Finder is an Apify Actor that automatically discovers important pages on any website. Given a list of URLs, it crawls each site and returns the URLs of six key page types along with confidence scores and evidence explaining how each page was found.

Key page types discovered:

  • Pricing - Plans, costs, and billing information
  • Documentation - API docs, guides, and developer resources
  • Status - System uptime and incident pages
  • Security - Trust centers, compliance, and security policies
  • Privacy - Privacy policies and data protection information
  • Terms - Terms of service and legal agreements

The Actor uses a multi-phase discovery approach that combines URL pattern probing, homepage link extraction, and intelligent crawling to find pages even on sites with non-standard structures.

🎯 Why scrape key pages from websites?

Finding key pages manually across dozens or hundreds of websites is time-consuming and error-prone. This Actor automates the process, making it valuable for:

  • Competitor Analysis - Quickly gather pricing pages and documentation from competitor websites to understand their offerings and positioning
  • Sales Intelligence - Enrich lead data with links to company pricing, security, and compliance pages before outreach
  • Website Auditing - Verify that your own sites have discoverable key pages and assess how competitors structure their information architecture
  • Market Research - Collect pricing pages across an industry to analyze pricing trends and strategies
  • Due Diligence - Gather legal documents (privacy policies, terms) from potential partners or acquisition targets
  • Compliance Monitoring - Track privacy policy and terms changes across a portfolio of vendors

🚀 How to use Website Key Pages Finder

Follow these steps to find key pages on any website:

  1. Open the Actor in Apify Console or via the API
  2. Add your URLs to the Start URLs field (homepage URLs work best)
  3. Configure options (optional) - adjust crawl depth, page limits, and timeout as needed
  4. Run the Actor by clicking "Start" or calling the API
  5. Download results from the Dataset tab in JSON, CSV, or Excel format

Example Input

{
"startUrls": [
{ "url": "https://apify.com" },
{ "url": "https://stripe.com" },
{ "url": "https://github.com" }
],
"maxDepth": 1,
"maxPagesPerDomain": 12,
"includeSubdomains": true
}

💰 How much does it cost to find key pages?

Website Key Pages Finder uses Pay Per Event (PPE) pricing, so you only pay for the websites you analyze.

Pricing:

  • Per website analyzed: $0.005 per site
  • Start fee: $0.005 per run
  • No hidden compute costs - the price per website includes all crawling and processing

Cost control:

  • Set a maximum spend per run in the Actor input to limit costs
  • The Actor stops gracefully when your spending limit is reached
  • Remaining URLs are skipped (not charged) when limit is hit

Example costs:

WebsitesCost
10$0.055
100$0.505
1,000$5.005

Free tier: Apify provides a free tier with monthly credits, typically sufficient for testing and small-scale usage. Check your Apify account for current free tier limits.

📥 Input

FieldTypeDefaultDescription
startUrlsarrayrequiredURLs of websites to analyze. Each URL should be a homepage or any page from the domain.
maxDepthinteger1Crawl depth. 0 = homepage only, 1 = homepage + priority pages (recommended).
maxPagesPerDomaininteger12Maximum pages to fetch per domain. Controls costs and processing time.
includeSubdomainsbooleantrueWhether to include subdomains when discovering pages (e.g., docs.example.com).
returnTopNinteger1Number of top candidates to return per page type. Set higher to see alternative candidates.
timeoutSecsinteger30Timeout in seconds for processing each site.
proxyConfigurationobject{ "useApifyProxy": false }Proxy settings for sites that block direct access.
debugbooleanfalseInclude debug information (raw candidates) in output.

📤 Output

Each website produces one result object in the dataset:

{
"schemaVersion": "1.0.0",
"inputUrl": "https://apify.com",
"finalUrl": "https://apify.com/",
"domain": "apify.com",
"pages": {
"pricing": {
"url": "https://apify.com/pricing",
"confidence": 0.95,
"evidence": ["exact_path:/pricing", "anchor:Pricing", "footer_link"]
},
"docs": {
"url": "https://docs.apify.com",
"confidence": 0.92,
"evidence": ["subdomain:docs", "anchor:Documentation"]
},
"status": {
"url": "https://status.apify.com",
"confidence": 0.88,
"evidence": ["subdomain:status", "anchor:Status"]
},
"security": {
"url": "https://apify.com/security",
"confidence": 0.85,
"evidence": ["path_token:security", "footer_link"]
},
"privacy": {
"url": "https://apify.com/privacy-policy",
"confidence": 0.90,
"evidence": ["path_token:privacy", "anchor:Privacy Policy", "footer_link"]
},
"terms": {
"url": "https://apify.com/terms-of-service",
"confidence": 0.88,
"evidence": ["path_token:terms", "anchor:Terms of Service", "footer_link"]
}
},
"crawlStats": {
"pagesFetched": 8,
"timeMs": 2340,
"errors": [],
"likelyJsRendered": false
},
"timestamp": "2024-01-15T10:30:00.000Z"
}

Output Fields

FieldDescription
inputUrlThe URL you provided
finalUrlThe URL after following redirects
domainThe root domain extracted from the URL
pagesObject containing discovered pages for each type
pages.[type].urlURL of the discovered page
pages.[type].confidenceConfidence score from 0 to 1
pages.[type].evidenceArray of signals that contributed to the score
crawlStats.pagesFetchedNumber of pages fetched during discovery
crawlStats.timeMsProcessing time in milliseconds
crawlStats.errorsAny errors encountered during crawling
crawlStats.likelyJsRenderedWhether the site appears to be JavaScript-rendered
topCandidates(Optional) When returnTopN > 1, contains all top candidates per type

Page Types

TypeCommon URL PatternsDescription
pricing/pricing, /plans, /pricePricing and plan information
docs/docs, /documentation, /api, /developerDocumentation and API reference
status/status, /uptime, status.example.comSystem status and uptime pages
security/security, /trust, /complianceSecurity and compliance information
privacy/privacy, /privacy-policy, /data-protectionPrivacy policy
terms/terms, /tos, /terms-of-service, /legalTerms of service

📊 How does confidence scoring work?

Each discovered page includes a confidence score between 0 and 1 that indicates how certain the Actor is that the page is correct.

ScoreMeaning
0.80 - 1.00Very confident - strong signals from URL path, anchor text, and page location
0.50 - 0.79Probable match - good evidence but some ambiguity
0.30 - 0.49Best guess - limited evidence, may need manual verification
Below 0.30Not returned - insufficient confidence

Scoring Factors

  1. Discovery Source - Base score from how the page was found

    • Fast-path (direct URL probe): +0.40
    • Homepage link: +0.30
    • Depth-1 crawl: +0.20
    • Sitemap: +0.10
  2. Positive Signals - Added to the score

    • Exact path match (e.g., /pricing): +0.30
    • Token in path (e.g., /pricing-plans): +0.20
    • Anchor text match: +0.25
    • Footer/nav location: +0.12-0.15
    • Subdomain match (e.g., docs.example.com): +0.25
  3. Verification - Final adjustment after checking page content

    • Title matches expected keywords: +0.20
    • Content verified: +0.15
    • HTTP error: -0.50
    • Wrong content type: -0.30

🔗 Integrations and API access

REST API

Run the Actor via the Apify API:

curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~website-key-pages-finder/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": [
{ "url": "https://example.com" }
],
"maxDepth": 1,
"maxPagesPerDomain": 12
}'

JavaScript SDK

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('YOUR_USERNAME/website-key-pages-finder').call({
startUrls: [
{ url: 'https://example.com' },
{ url: 'https://another-site.com' }
],
maxDepth: 1,
maxPagesPerDomain: 12
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python SDK

from apify_client import ApifyClient
client = ApifyClient('YOUR_API_TOKEN')
run = client.actor('YOUR_USERNAME/website-key-pages-finder').call(run_input={
'startUrls': [
{'url': 'https://example.com'},
{'url': 'https://another-site.com'}
],
'maxDepth': 1,
'maxPagesPerDomain': 12
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

Webhooks and Integrations

Apify supports webhooks to notify your systems when a run completes. You can also integrate with:

  • Zapier - Trigger workflows when new data is available
  • Make (Integromat) - Build automated pipelines
  • Google Sheets - Export results directly to spreadsheets
  • Slack - Get notifications when runs complete

See Apify Integrations for more options.

❓ FAQ

What happens if a page type isn't found?

If the Actor cannot find a page type with sufficient confidence (score >= 0.30), that type will be omitted from the pages object in the output. This is normal - not all websites have all page types.

Why are some confidence scores lower than expected?

Confidence scores depend on the signals found during crawling. Sites with non-standard URL structures, unusual navigation, or pages behind authentication may have lower scores. Check the evidence array to understand what signals were detected.

Can this Actor handle JavaScript-rendered websites?

This Actor uses HTTP-based crawling (CheerioCrawler) for speed and efficiency. Sites that heavily rely on JavaScript for rendering may have incomplete results. The output includes a likelyJsRendered flag to indicate when this might be an issue. For such sites, consider using a browser-based scraper.

How do I increase accuracy for specific sites?

  • Increase maxPagesPerDomain to allow more thorough crawling
  • Set returnTopN > 1 to see alternative candidates
  • Enable debug mode to see all candidates and their scores

What's the difference between maxDepth 0 and 1?

  • maxDepth: 0 - Only analyzes the homepage (fastest, cheapest)
  • maxDepth: 1 - Analyzes homepage plus follows promising links (recommended for best results)

Does this work with sites behind login?

No, this Actor only crawls publicly accessible pages. It cannot handle authentication or login flows.

Web scraping legality varies by jurisdiction and use case. When using this Actor:

  • Respect robots.txt - The Actor follows standard web crawling conventions
  • Review Terms of Service - Some websites explicitly prohibit scraping in their ToS
  • Use reasonable rate limits - The Actor includes delays to avoid overwhelming servers
  • Public data only - Only scrape publicly accessible information
  • Intended use - Ensure your use case complies with applicable laws (GDPR, CCPA, etc.)

This Actor is designed for legitimate business purposes such as competitive research, lead enrichment, and website auditing. Users are responsible for ensuring their use complies with applicable laws and website terms of service.

Disclaimer: This information is not legal advice. Consult with a legal professional for guidance specific to your jurisdiction and use case.

⚠️ Limitations

  • JavaScript-rendered content - Uses HTTP-based crawling (CheerioCrawler), so heavily JavaScript-rendered sites may have incomplete results. Check the likelyJsRendered flag.
  • Rate limiting - Some sites may block rapid requests. The Actor includes retry logic, but sites with aggressive anti-bot measures may cause failures.
  • Page budget - Limited to maxPagesPerDomain fetches per site to control costs. Increase this for complex sites.
  • Crawl depth - Currently supports depth 0 (homepage only) or depth 1 (homepage + one level). Deep recursive crawling is not supported.
  • Authentication - Cannot access pages behind login or authentication.

Looking for more web scraping solutions? Check out these related Actors:

📚 Resources and support

🛠️ Local Development

Prerequisites

  • Node.js 18+
  • npm

Setup

# Install dependencies
npm install
# Run locally
apify run
# Run with custom input
apify run --input='{"startUrls":[{"url":"https://example.com"}]}'

Deploy

# Login to Apify
apify login
# Deploy to Apify platform
apify push