Pricing

Pay per event

Website Content Crawler — AI & RAG Ready

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

Pricing

Pay per event

Rating

0.0

(0)

Developer

NanoScrape

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Why This Actor?

AI-optimized output — Markdown + plain text per page, with content type detection
Main content extraction — Readability-style selectors remove noise (nav, footer, ads, sidebars)
Flexible crawl modes — Fetch a list of URLs directly (depth=0) or crawl entire sites (depth=1-5)
Concurrent processing — Up to 20 parallel workers for high-throughput extraction
Pay-per-page pricing — Only pay for pages successfully crawled

Use Cases

Build RAG knowledge bases from company documentation sites
Feed LLMs with up-to-date content from blog posts and news articles
Extract article text for AI summarization pipelines
Crawl competitor sites for content analysis
Bulk-convert web pages to Markdown for offline use

Input

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to crawl. Use `maxDepth=0` for flat fetch, `maxDepth>0` to follow links
`maxDepth`	integer	`0`	Crawl depth. 0 = start pages only, 1 = start pages + their links, 2 = two levels, etc.
`maxPagesPerCrawl`	integer	`100`	Maximum total pages to process across all start URLs
`maxPagesPerDomain`	integer	`50`	Maximum pages per unique domain
`maxConcurrency`	integer	`5`	Number of parallel workers (1–20)
`extractMainContent`	boolean	`true`	Strip nav/footer/ads using readability-style selectors
`proxyConfiguration`	object	Apify proxy	Proxy settings

Output

One record per crawled page:

Field	Type	Description
`url`	string	URL of the crawled page
`title`	string	Page title (og:title or HTML title tag)
`description`	string	Meta description (description or og:description)
`markdown`	string	Clean Markdown output, up to 50,000 characters
`text`	string	Plain text with all HTML removed, up to 10,000 characters
`word_count`	integer	Number of words in the extracted plain text
`content_type`	string	Detected type: `article`, `blog`, `documentation`, or `generic`
`depth`	integer	Crawl depth (0 = start URL)
`start_url`	string	Start URL that led to this page
`links_found`	integer	New internal links discovered and added to crawl queue
`status_code`	integer	HTTP status code
`scraped_at`	string	ISO 8601 UTC timestamp

Example Input

Fetch a list of documentation pages (no crawling):

{
  "startUrls": [
    "https://docs.example.com/api/overview",
    "https://docs.example.com/api/authentication"
  ],
  "maxDepth": 0,
  "extractMainContent": true
}

Crawl an entire blog up to 2 levels deep:

{
  "startUrls": ["https://blog.example.com"],
  "maxDepth": 2,
  "maxPagesPerCrawl": 200,
  "maxConcurrency": 10,
  "extractMainContent": true
}

Pricing

Event	Price
Actor start	$0.25 (flat)
Per 1,000 pages crawled	$1.00

MCP Integration

Use this actor directly from Claude or any MCP-compatible AI tool:

Use apify/santamaria-automations/website-content-crawler to crawl https://docs.example.com with maxDepth=1 and extractMainContent=true, then summarize the documentation

Actor URL: apify/santamaria-automations/website-content-crawler

Notes

Challenge pages (Cloudflare, DataDome, PerimeterX) are detected and skipped automatically
Deduplication prevents the same URL from being crawled twice in the same run
Content type detection identifies articles, blog posts, and documentation pages
Main content extraction uses CSS selector priority: article-specific classes → semantic tags → body fallback

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

Web Content Extractor - Clean Markdown for AI

geekguymj/web-content-extractor

Extract clean, readable markdown content from any web page. Removes navigation, ads, footers, and boilerplate — outputs structured markdown optimized for LLM training, RAG pipelines, and AI agents. Pay-per-event pricing. $0.002/page.

Matthew Jenkins

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

陈俊杰

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

entranced_gelato/website-to-markdown-crawler

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.

AIDevs

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Website to Markdown for RAG & LLMs

hereditary_model/website-to-markdown

Crawls a website and converts every page into clean, LLM-ready Markdown for RAG pipelines, vector databases, and AI agents. Removes nav, ads, and boilerplate. Predictable pricing: $0.004 per page converted.

Aaron Marxsen

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M