Website Content Crawler — AI & RAG Ready
Pricing
Pay per event
Website Content Crawler — AI & RAG Ready
Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Ale
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Extract clean Markdown and plain text from any website, optimized for AI ingestion, RAG pipelines, and LLM context windows. Readability-style main content extraction strips navigation, footers, sidebars, and ads so your AI gets only the content that matters.
Why This Actor?
- AI-optimized output — Markdown + plain text per page, with content type detection
- Main content extraction — Readability-style selectors remove noise (nav, footer, ads, sidebars)
- Flexible crawl modes — Fetch a list of URLs directly (depth=0) or crawl entire sites (depth=1-5)
- Concurrent processing — Up to 20 parallel workers for high-throughput extraction
- Pay-per-page pricing — Only pay for pages successfully crawled
Use Cases
- Build RAG knowledge bases from company documentation sites
- Feed LLMs with up-to-date content from blog posts and news articles
- Extract article text for AI summarization pipelines
- Crawl competitor sites for content analysis
- Bulk-convert web pages to Markdown for offline use
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs to crawl. Use maxDepth=0 for flat fetch, maxDepth>0 to follow links |
maxDepth | integer | 0 | Crawl depth. 0 = start pages only, 1 = start pages + their links, 2 = two levels, etc. |
maxPagesPerCrawl | integer | 100 | Maximum total pages to process across all start URLs |
maxPagesPerDomain | integer | 50 | Maximum pages per unique domain |
maxConcurrency | integer | 5 | Number of parallel workers (1–20) |
extractMainContent | boolean | true | Strip nav/footer/ads using readability-style selectors |
proxyConfiguration | object | Apify proxy | Proxy settings |
Output
One record per crawled page:
| Field | Type | Description |
|---|---|---|
url | string | URL of the crawled page |
title | string | Page title (og:title or HTML title tag) |
description | string | Meta description (description or og:description) |
markdown | string | Clean Markdown output, up to 50,000 characters |
text | string | Plain text with all HTML removed, up to 10,000 characters |
word_count | integer | Number of words in the extracted plain text |
content_type | string | Detected type: article, blog, documentation, or generic |
depth | integer | Crawl depth (0 = start URL) |
start_url | string | Start URL that led to this page |
links_found | integer | New internal links discovered and added to crawl queue |
status_code | integer | HTTP status code |
scraped_at | string | ISO 8601 UTC timestamp |
Example Input
Fetch a list of documentation pages (no crawling):
{"startUrls": ["https://docs.example.com/api/overview","https://docs.example.com/api/authentication"],"maxDepth": 0,"extractMainContent": true}
Crawl an entire blog up to 2 levels deep:
{"startUrls": ["https://blog.example.com"],"maxDepth": 2,"maxPagesPerCrawl": 200,"maxConcurrency": 10,"extractMainContent": true}
Pricing
| Event | Price |
|---|---|
| Actor start | $0.25 (flat) |
| Per 1,000 pages crawled | $1.00 |
MCP Integration
Use this actor directly from Claude or any MCP-compatible AI tool:
Use apify/santamaria-automations/website-content-crawler to crawl https://docs.example.com with maxDepth=1 and extractMainContent=true, then summarize the documentation
Actor URL: apify/santamaria-automations/website-content-crawler
Notes
- Challenge pages (Cloudflare, DataDome, PerimeterX) are detected and skipped automatically
- Deduplication prevents the same URL from being crawled twice in the same run
- Content type detection identifies articles, blog posts, and documentation pages
- Main content extraction uses CSS selector priority: article-specific classes → semantic tags → body fallback