Website Content Crawler
Pricing
from $0.01 / 1,000 results
Website Content Crawler
Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

John Rippy
Actor stats
0
Bookmarked
19
Total users
13
Monthly active users
2 days ago
Last modified
Categories
Share
Website Crawler - SEO Audit Crawler with LLM-Ready Markdown Extraction
Fast, reliable website crawler built for SEO audits and AI/LLM content analysis. Auto-discovers sitemaps, extracts metadata, headings, links, images, and LLM-ready markdown content from every page. Uses Mozilla Readability and Turndown for Firecrawl-like markdown extraction without external API costs.
Features
- LLM-Ready Markdown - Firecrawl-like extraction using Mozilla Readability + Turndown
- Sitemap Discovery - Automatic detection of sitemap.xml, sitemap_index.xml, wp-sitemap.xml
- Full HTML Extraction - Complete page HTML for custom parsing
- Metadata Extraction - Title, description, OG tags, canonical, robots directives
- Heading Structure - H1, H2, H3 hierarchy analysis
- Link Analysis - Internal and external link mapping
- Image Extraction - All images with alt text and src URLs
- Word Count - Text and markdown word counts per page
- Load Time Metrics - Page load time in milliseconds
- Status Code Tracking - HTTP status codes for broken link detection
- Bot Evasion - Fingerprint injection for reliable crawling
- Configurable Depth - Set crawl depth and page limits
- Demo Mode - Test with sample data before going live
Who Should Use This Actor?
SEO Agencies
Run technical SEO audits at scale. Extract metadata, heading structure, link architecture, and content from every page of a client's website in one crawl.
Content Teams
Extract clean markdown from any website for AI/LLM processing, content analysis, or migration projects. No Firecrawl API key required.
AI/ML Engineers
Build training datasets from websites with clean markdown extraction. Each page outputs structured data ready for LLM fine-tuning or RAG pipelines.
Web Developers
Audit site structure before migrations. Map internal links, find broken pages, and inventory all URLs with metadata.
Digital Marketing Agencies
Create comprehensive site audits for client onboarding. Analyze meta tags, heading hierarchy, and content structure across entire websites.
Competitive Intelligence Teams
Crawl competitor websites to analyze their content structure, internal linking strategy, and page architecture.
Quick Start
Demo Mode (Free Test)
{"demoMode": true}
Basic Website Crawl
{"startUrls": [{ "url": "https://example.com" }],"maxCrawlPages": 25,"maxCrawlDepth": 2,"crawlSitemap": true,"demoMode": false}
Deep SEO Audit Crawl
{"startUrls": [{ "url": "https://example.com" }],"maxCrawlPages": 500,"maxCrawlDepth": 5,"crawlSitemap": true,"demoMode": false}
Multi-Site Crawl
{"startUrls": [{ "url": "https://site1.com" },{ "url": "https://site2.com" },{ "url": "https://site3.com" }],"maxCrawlPages": 100,"maxCrawlDepth": 3,"crawlSitemap": true,"demoMode": false}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | - | URLs to start crawling from (required unless demoMode) |
maxCrawlPages | number | 25 | Maximum pages to crawl per site |
maxCrawlDepth | number | 2 | Maximum link depth to follow |
crawlSitemap | boolean | true | Auto-discover and parse sitemaps |
proxyConfiguration | object | Residential | Proxy settings |
demoMode | boolean | true | Return sample data for testing |
webhookUrl | string | - | Webhook URL for results delivery |
Output Format
{"url": "https://example.com/page","title": "Page Title","html": "<html>...</html>","text": "Page text content...","markdown": "# Page Title\n\nClean markdown content ready for LLMs...","statusCode": 200,"loadTimeMs": 1234,"metadata": {"description": "Meta description","keywords": "seo, crawler","ogTitle": "Open Graph Title","ogDescription": "OG description","ogImage": "https://example.com/og.jpg","canonical": "https://example.com/page","robots": "index, follow"},"headings": {"h1": ["Main Heading"],"h2": ["Subheading 1", "Subheading 2"],"h3": []},"links": {"internal": ["https://example.com/other"],"external": ["https://external.com"]},"images": [{ "src": "https://example.com/image.jpg", "alt": "Alt text" }],"wordCount": 1500,"markdownWordCount": 1200,"crawledAt": "2026-01-28T12:00:00.000Z"}
Pricing (Pay-Per-Event)
| Event | Description | Price |
|---|---|---|
page_crawled | Per page crawled with full extraction | $0.005 |
sitemap_discovered | Per sitemap discovered and parsed | $0.01 |
Example costs:
- 25 pages (no sitemap): 25 x $0.005 = $0.13
- 100 pages + sitemap: (100 x $0.005) + $0.01 = $0.51
- 500 pages + sitemap: (500 x $0.005) + $0.01 = $2.51
- Demo mode: $0.00
Common Scenarios
Scenario 1: Technical SEO Audit
{"startUrls": [{ "url": "https://client-website.com" }],"maxCrawlPages": 500,"maxCrawlDepth": 5,"crawlSitemap": true,"demoMode": false}
Crawl an entire client website to audit meta tags, headings, links, and content structure.
Scenario 2: LLM Content Extraction
{"startUrls": [{ "url": "https://documentation-site.com" }],"maxCrawlPages": 200,"maxCrawlDepth": 3,"crawlSitemap": true,"demoMode": false}
Extract clean markdown from documentation sites for RAG pipelines or AI training data.
Scenario 3: Pre-Migration URL Inventory
{"startUrls": [{ "url": "https://old-website.com" }],"maxCrawlPages": 1000,"maxCrawlDepth": 10,"crawlSitemap": true,"demoMode": false}
Create a complete URL inventory with metadata before a site migration or redesign.
Webhook & Automation Integration
Zapier / Make.com / n8n
- Create a webhook trigger in your automation platform
- Copy the webhook URL to
webhookUrl - Route results to Google Sheets, databases, or analysis tools
Popular automations:
- Crawl data -> Google Sheets (SEO audit spreadsheet)
- Broken pages (4xx/5xx) -> Slack alert (site health monitoring)
- Markdown content -> Database (AI training data pipeline)
- Page metadata -> Airtable (content inventory)
Apify Scheduled Runs
Schedule weekly or monthly crawls to track site changes and detect regressions.
FAQ
Q: How is this different from Firecrawl?
A: This crawler provides Firecrawl-like markdown extraction using Mozilla Readability + Turndown without requiring an external API key. You get clean, LLM-ready markdown at a fraction of the cost.
Q: Does it handle JavaScript-rendered pages?
A: Yes. The crawler uses a headless browser with fingerprint injection to render JavaScript-heavy pages before extracting content.
Q: Can I crawl password-protected pages?
A: Currently, only publicly accessible pages are supported. The crawler does not handle authentication.
Q: How does sitemap discovery work?
A: The crawler automatically checks for sitemap.xml, sitemap_index.xml, and wp-sitemap.xml at the domain root. Discovered URLs are added to the crawl queue.
Q: What happens with redirects?
A: Redirects are followed automatically. The final URL and status code are recorded.
Common Problems & Solutions
"Pages not loading"
- Some sites require JavaScript rendering - this is handled automatically
- Check if the site has aggressive bot protection
- Try with residential proxy configuration
"Crawl stops early"
- Check
maxCrawlPagesandmaxCrawlDepthlimits - Some sites have few internal links, limiting discovery
- Enable
crawlSitemap: trueto discover more URLs
"Missing markdown content"
- Pages with very little text content produce minimal markdown
- Image-heavy pages may have low
markdownWordCount - Check that the page has actual text content
"Demo data showing"
- Set
demoMode: false- no API keys required
📞 Support
- Actor Arsenal: Full Actor Catalog
- Developer: John Rippy
Built by John Rippy | Actor Arsenal