Website Content Crawler avatar
Website Content Crawler

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

John Rippy

John Rippy

Maintained by Community

Actor stats

0

Bookmarked

19

Total users

13

Monthly active users

2 days ago

Last modified

Share

Website Crawler - SEO Audit Crawler with LLM-Ready Markdown Extraction

Fast, reliable website crawler built for SEO audits and AI/LLM content analysis. Auto-discovers sitemaps, extracts metadata, headings, links, images, and LLM-ready markdown content from every page. Uses Mozilla Readability and Turndown for Firecrawl-like markdown extraction without external API costs.

Features

  • LLM-Ready Markdown - Firecrawl-like extraction using Mozilla Readability + Turndown
  • Sitemap Discovery - Automatic detection of sitemap.xml, sitemap_index.xml, wp-sitemap.xml
  • Full HTML Extraction - Complete page HTML for custom parsing
  • Metadata Extraction - Title, description, OG tags, canonical, robots directives
  • Heading Structure - H1, H2, H3 hierarchy analysis
  • Link Analysis - Internal and external link mapping
  • Image Extraction - All images with alt text and src URLs
  • Word Count - Text and markdown word counts per page
  • Load Time Metrics - Page load time in milliseconds
  • Status Code Tracking - HTTP status codes for broken link detection
  • Bot Evasion - Fingerprint injection for reliable crawling
  • Configurable Depth - Set crawl depth and page limits
  • Demo Mode - Test with sample data before going live

Who Should Use This Actor?

SEO Agencies

Run technical SEO audits at scale. Extract metadata, heading structure, link architecture, and content from every page of a client's website in one crawl.

Content Teams

Extract clean markdown from any website for AI/LLM processing, content analysis, or migration projects. No Firecrawl API key required.

AI/ML Engineers

Build training datasets from websites with clean markdown extraction. Each page outputs structured data ready for LLM fine-tuning or RAG pipelines.

Web Developers

Audit site structure before migrations. Map internal links, find broken pages, and inventory all URLs with metadata.

Digital Marketing Agencies

Create comprehensive site audits for client onboarding. Analyze meta tags, heading hierarchy, and content structure across entire websites.

Competitive Intelligence Teams

Crawl competitor websites to analyze their content structure, internal linking strategy, and page architecture.

Quick Start

Demo Mode (Free Test)

{
"demoMode": true
}

Basic Website Crawl

{
"startUrls": [{ "url": "https://example.com" }],
"maxCrawlPages": 25,
"maxCrawlDepth": 2,
"crawlSitemap": true,
"demoMode": false
}

Deep SEO Audit Crawl

{
"startUrls": [{ "url": "https://example.com" }],
"maxCrawlPages": 500,
"maxCrawlDepth": 5,
"crawlSitemap": true,
"demoMode": false
}

Multi-Site Crawl

{
"startUrls": [
{ "url": "https://site1.com" },
{ "url": "https://site2.com" },
{ "url": "https://site3.com" }
],
"maxCrawlPages": 100,
"maxCrawlDepth": 3,
"crawlSitemap": true,
"demoMode": false
}

Input Parameters

ParameterTypeDefaultDescription
startUrlsarray-URLs to start crawling from (required unless demoMode)
maxCrawlPagesnumber25Maximum pages to crawl per site
maxCrawlDepthnumber2Maximum link depth to follow
crawlSitemapbooleantrueAuto-discover and parse sitemaps
proxyConfigurationobjectResidentialProxy settings
demoModebooleantrueReturn sample data for testing
webhookUrlstring-Webhook URL for results delivery

Output Format

{
"url": "https://example.com/page",
"title": "Page Title",
"html": "<html>...</html>",
"text": "Page text content...",
"markdown": "# Page Title\n\nClean markdown content ready for LLMs...",
"statusCode": 200,
"loadTimeMs": 1234,
"metadata": {
"description": "Meta description",
"keywords": "seo, crawler",
"ogTitle": "Open Graph Title",
"ogDescription": "OG description",
"ogImage": "https://example.com/og.jpg",
"canonical": "https://example.com/page",
"robots": "index, follow"
},
"headings": {
"h1": ["Main Heading"],
"h2": ["Subheading 1", "Subheading 2"],
"h3": []
},
"links": {
"internal": ["https://example.com/other"],
"external": ["https://external.com"]
},
"images": [
{ "src": "https://example.com/image.jpg", "alt": "Alt text" }
],
"wordCount": 1500,
"markdownWordCount": 1200,
"crawledAt": "2026-01-28T12:00:00.000Z"
}

Pricing (Pay-Per-Event)

EventDescriptionPrice
page_crawledPer page crawled with full extraction$0.005
sitemap_discoveredPer sitemap discovered and parsed$0.01

Example costs:

  • 25 pages (no sitemap): 25 x $0.005 = $0.13
  • 100 pages + sitemap: (100 x $0.005) + $0.01 = $0.51
  • 500 pages + sitemap: (500 x $0.005) + $0.01 = $2.51
  • Demo mode: $0.00

Common Scenarios

Scenario 1: Technical SEO Audit

{
"startUrls": [{ "url": "https://client-website.com" }],
"maxCrawlPages": 500,
"maxCrawlDepth": 5,
"crawlSitemap": true,
"demoMode": false
}

Crawl an entire client website to audit meta tags, headings, links, and content structure.

Scenario 2: LLM Content Extraction

{
"startUrls": [{ "url": "https://documentation-site.com" }],
"maxCrawlPages": 200,
"maxCrawlDepth": 3,
"crawlSitemap": true,
"demoMode": false
}

Extract clean markdown from documentation sites for RAG pipelines or AI training data.

Scenario 3: Pre-Migration URL Inventory

{
"startUrls": [{ "url": "https://old-website.com" }],
"maxCrawlPages": 1000,
"maxCrawlDepth": 10,
"crawlSitemap": true,
"demoMode": false
}

Create a complete URL inventory with metadata before a site migration or redesign.

Webhook & Automation Integration

Zapier / Make.com / n8n

  1. Create a webhook trigger in your automation platform
  2. Copy the webhook URL to webhookUrl
  3. Route results to Google Sheets, databases, or analysis tools

Popular automations:

  • Crawl data -> Google Sheets (SEO audit spreadsheet)
  • Broken pages (4xx/5xx) -> Slack alert (site health monitoring)
  • Markdown content -> Database (AI training data pipeline)
  • Page metadata -> Airtable (content inventory)

Apify Scheduled Runs

Schedule weekly or monthly crawls to track site changes and detect regressions.

FAQ

Q: How is this different from Firecrawl?

A: This crawler provides Firecrawl-like markdown extraction using Mozilla Readability + Turndown without requiring an external API key. You get clean, LLM-ready markdown at a fraction of the cost.

Q: Does it handle JavaScript-rendered pages?

A: Yes. The crawler uses a headless browser with fingerprint injection to render JavaScript-heavy pages before extracting content.

Q: Can I crawl password-protected pages?

A: Currently, only publicly accessible pages are supported. The crawler does not handle authentication.

Q: How does sitemap discovery work?

A: The crawler automatically checks for sitemap.xml, sitemap_index.xml, and wp-sitemap.xml at the domain root. Discovered URLs are added to the crawl queue.

Q: What happens with redirects?

A: Redirects are followed automatically. The final URL and status code are recorded.

Common Problems & Solutions

"Pages not loading"

  • Some sites require JavaScript rendering - this is handled automatically
  • Check if the site has aggressive bot protection
  • Try with residential proxy configuration

"Crawl stops early"

  • Check maxCrawlPages and maxCrawlDepth limits
  • Some sites have few internal links, limiting discovery
  • Enable crawlSitemap: true to discover more URLs

"Missing markdown content"

  • Pages with very little text content produce minimal markdown
  • Image-heavy pages may have low markdownWordCount
  • Check that the page has actual text content

"Demo data showing"

  • Set demoMode: false - no API keys required

📞 Support


Built by John Rippy | Actor Arsenal