Website Content Crawler — AI & RAG Ready avatar

Website Content Crawler — AI & RAG Ready

Pricing

Pay per event

Go to Apify Store
Website Content Crawler — AI & RAG Ready

Website Content Crawler — AI & RAG Ready

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Ale

Ale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Extract clean Markdown and plain text from any website, optimized for AI ingestion, RAG pipelines, and LLM context windows. Readability-style main content extraction strips navigation, footers, sidebars, and ads so your AI gets only the content that matters.

Why This Actor?

  • AI-optimized output — Markdown + plain text per page, with content type detection
  • Main content extraction — Readability-style selectors remove noise (nav, footer, ads, sidebars)
  • Flexible crawl modes — Fetch a list of URLs directly (depth=0) or crawl entire sites (depth=1-5)
  • Concurrent processing — Up to 20 parallel workers for high-throughput extraction
  • Pay-per-page pricing — Only pay for pages successfully crawled

Use Cases

  • Build RAG knowledge bases from company documentation sites
  • Feed LLMs with up-to-date content from blog posts and news articles
  • Extract article text for AI summarization pipelines
  • Crawl competitor sites for content analysis
  • Bulk-convert web pages to Markdown for offline use

Input

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to crawl. Use maxDepth=0 for flat fetch, maxDepth>0 to follow links
maxDepthinteger0Crawl depth. 0 = start pages only, 1 = start pages + their links, 2 = two levels, etc.
maxPagesPerCrawlinteger100Maximum total pages to process across all start URLs
maxPagesPerDomaininteger50Maximum pages per unique domain
maxConcurrencyinteger5Number of parallel workers (1–20)
extractMainContentbooleantrueStrip nav/footer/ads using readability-style selectors
proxyConfigurationobjectApify proxyProxy settings

Output

One record per crawled page:

FieldTypeDescription
urlstringURL of the crawled page
titlestringPage title (og:title or HTML title tag)
descriptionstringMeta description (description or og:description)
markdownstringClean Markdown output, up to 50,000 characters
textstringPlain text with all HTML removed, up to 10,000 characters
word_countintegerNumber of words in the extracted plain text
content_typestringDetected type: article, blog, documentation, or generic
depthintegerCrawl depth (0 = start URL)
start_urlstringStart URL that led to this page
links_foundintegerNew internal links discovered and added to crawl queue
status_codeintegerHTTP status code
scraped_atstringISO 8601 UTC timestamp

Example Input

Fetch a list of documentation pages (no crawling):

{
"startUrls": [
"https://docs.example.com/api/overview",
"https://docs.example.com/api/authentication"
],
"maxDepth": 0,
"extractMainContent": true
}

Crawl an entire blog up to 2 levels deep:

{
"startUrls": ["https://blog.example.com"],
"maxDepth": 2,
"maxPagesPerCrawl": 200,
"maxConcurrency": 10,
"extractMainContent": true
}

Pricing

EventPrice
Actor start$0.25 (flat)
Per 1,000 pages crawled$1.00

MCP Integration

Use this actor directly from Claude or any MCP-compatible AI tool:

Use apify/santamaria-automations/website-content-crawler to crawl https://docs.example.com with maxDepth=1 and extractMainContent=true, then summarize the documentation

Actor URL: apify/santamaria-automations/website-content-crawler

Notes

  • Challenge pages (Cloudflare, DataDome, PerimeterX) are detected and skipped automatically
  • Deduplication prevents the same URL from being crawled twice in the same run
  • Content type detection identifies articles, blog posts, and documentation pages
  • Main content extraction uses CSS selector priority: article-specific classes → semantic tags → body fallback