Website to RAG Markdown Crawler
Pricing
from $0.50 / 1,000 results
Website to RAG Markdown Crawler
Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer
Ralph T
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 hours ago
Last modified
Categories
Share
Crawl any website, documentation site, blog, or sitemap and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, AI agents, and vector database pipelines.
Quick start
Use a focused sitemap or docs URL first, keep maxPages low, inspect the RAG chunks view, then scale up:
{"startUrls": [{ "url": "https://docs.apify.com/sitemap.xml" }],"maxPages": 5,"maxDepth": 0,"expandSitemaps": true,"includePatterns": ["^https://docs\\.apify\\.com/platform/actors"],"chunkSize": 1200,"chunkOverlap": 150,"includePageRecords": true}
What it does
- Starts from one or more web pages or
sitemap.xmlURLs. - Expands sitemap indexes and sitemap URL sets into crawlable page URLs.
- Follows same-domain links up to a configurable depth.
- Removes navigation/footer/script/style noise.
- Converts HTML to clean Markdown.
- Emits both full-page records and smaller RAG chunk records.
- Adds estimated token counts for pages and chunks.
- Includes source URL, title, description, timestamps, character counts, token estimates, and chunk metadata.
Best for
- Preparing documentation sites for RAG.
- Building AI chatbot or AI support-bot knowledge bases.
- Creating clean Markdown from help centers, blogs, changelogs, and product docs.
- Turning competitor docs/blogs into structured internal research data.
- Feeding LangChain, LlamaIndex, Supabase, Chroma, Pinecone, Qdrant, or custom vector pipelines.
Input example
{"startUrls": [{ "url": "https://docs.apify.com/sitemap.xml" }],"maxPages": 25,"maxDepth": 1,"expandSitemaps": true,"maxSitemapUrls": 5000,"includePatterns": ["^https://docs\\.apify\\.com/platform/actors"],"excludePatterns": ["/login", "/signup", "#"],"removeSelectors": ["nav", "footer", "script", "style", "noscript", "svg"],"chunkSize": 1200,"chunkOverlap": 150,"sameDomainOnly": true,"includePageRecords": true}
Output records
The Actor defines Apify dataset/output schemas so the Output tab has dedicated views for RAG chunks, full pages, and metadata. The default dataset contains two record types by default.
page
Full-page Markdown record:
{"recordType": "page","url": "https://example.com/","requestedUrl": "https://example.com/","title": "Example Domain","description": "","source": "sitemap","sitemapUrl": "https://example.com/sitemap.xml","markdown": "# Example Domain...","charCount": 167,"estimatedTokenCount": 42,"tokenCountMethod": "approx_chars_per_4","chunkCount": 1,"crawledAt": "2026-07-04T00:00:00.000Z"}
chunk
RAG-ready chunk record:
{"recordType": "chunk","url": "https://example.com/","title": "Example Domain","chunkIndex": 0,"chunkCount": 1,"text": "# Example Domain...","charCount": 167,"estimatedTokenCount": 42,"tokenCountMethod": "approx_chars_per_4","metadata": {"source": "https://example.com/","title": "Example Domain","crawledAt": "2026-07-04T00:00:00.000Z","sourceType": "sitemap","sitemapUrl": "https://example.com/sitemap.xml"}}
Sitemap support
If a start URL looks like a sitemap, for example https://example.com/sitemap.xml, the Actor extracts URLs from <loc> entries and crawls the matching pages. Sitemap indexes are followed recursively. Use includePatterns and excludePatterns to focus large sitemaps before crawling.
Token counts
The Actor includes an estimatedTokenCount field for each page and chunk using a fast approx_chars_per_4 method. This is useful for budgeting embedding jobs and sizing RAG chunks. Treat it as an estimate rather than an exact model-specific tokenizer count.
Example workflow
- Enter a website URL or sitemap URL.
- Set
maxPagesandmaxDepthto control crawl size. - Use
includePatterns/excludePatternsto keep the crawl focused. - Run the Actor.
- Export
chunkrecords as JSON/JSONL. - Load those chunks into your vector database or RAG pipeline.
Notes
- This Actor is optimized for regular HTML pages, blogs, documentation sites, and help centers.
- JavaScript-heavy single-page apps may need a browser-based crawler variant.
- Keep
maxPageslow for first runs, inspect output, then scale up. - Disable
includePageRecordsif you only want chunk records for embedding.