Website to RAG Markdown Crawler avatar

Website to RAG Markdown Crawler

Pricing

from $0.50 / 1,000 results

Go to Apify Store
Website to RAG Markdown Crawler

Website to RAG Markdown Crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Ralph T

Ralph T

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 hours ago

Last modified

Share

Crawl any website, documentation site, blog, or sitemap and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, AI agents, and vector database pipelines.

Quick start

Use a focused sitemap or docs URL first, keep maxPages low, inspect the RAG chunks view, then scale up:

{
"startUrls": [{ "url": "https://docs.apify.com/sitemap.xml" }],
"maxPages": 5,
"maxDepth": 0,
"expandSitemaps": true,
"includePatterns": ["^https://docs\\.apify\\.com/platform/actors"],
"chunkSize": 1200,
"chunkOverlap": 150,
"includePageRecords": true
}

What it does

  • Starts from one or more web pages or sitemap.xml URLs.
  • Expands sitemap indexes and sitemap URL sets into crawlable page URLs.
  • Follows same-domain links up to a configurable depth.
  • Removes navigation/footer/script/style noise.
  • Converts HTML to clean Markdown.
  • Emits both full-page records and smaller RAG chunk records.
  • Adds estimated token counts for pages and chunks.
  • Includes source URL, title, description, timestamps, character counts, token estimates, and chunk metadata.

Best for

  • Preparing documentation sites for RAG.
  • Building AI chatbot or AI support-bot knowledge bases.
  • Creating clean Markdown from help centers, blogs, changelogs, and product docs.
  • Turning competitor docs/blogs into structured internal research data.
  • Feeding LangChain, LlamaIndex, Supabase, Chroma, Pinecone, Qdrant, or custom vector pipelines.

Input example

{
"startUrls": [{ "url": "https://docs.apify.com/sitemap.xml" }],
"maxPages": 25,
"maxDepth": 1,
"expandSitemaps": true,
"maxSitemapUrls": 5000,
"includePatterns": ["^https://docs\\.apify\\.com/platform/actors"],
"excludePatterns": ["/login", "/signup", "#"],
"removeSelectors": ["nav", "footer", "script", "style", "noscript", "svg"],
"chunkSize": 1200,
"chunkOverlap": 150,
"sameDomainOnly": true,
"includePageRecords": true
}

Output records

The Actor defines Apify dataset/output schemas so the Output tab has dedicated views for RAG chunks, full pages, and metadata. The default dataset contains two record types by default.

page

Full-page Markdown record:

{
"recordType": "page",
"url": "https://example.com/",
"requestedUrl": "https://example.com/",
"title": "Example Domain",
"description": "",
"source": "sitemap",
"sitemapUrl": "https://example.com/sitemap.xml",
"markdown": "# Example Domain...",
"charCount": 167,
"estimatedTokenCount": 42,
"tokenCountMethod": "approx_chars_per_4",
"chunkCount": 1,
"crawledAt": "2026-07-04T00:00:00.000Z"
}

chunk

RAG-ready chunk record:

{
"recordType": "chunk",
"url": "https://example.com/",
"title": "Example Domain",
"chunkIndex": 0,
"chunkCount": 1,
"text": "# Example Domain...",
"charCount": 167,
"estimatedTokenCount": 42,
"tokenCountMethod": "approx_chars_per_4",
"metadata": {
"source": "https://example.com/",
"title": "Example Domain",
"crawledAt": "2026-07-04T00:00:00.000Z",
"sourceType": "sitemap",
"sitemapUrl": "https://example.com/sitemap.xml"
}
}

Sitemap support

If a start URL looks like a sitemap, for example https://example.com/sitemap.xml, the Actor extracts URLs from <loc> entries and crawls the matching pages. Sitemap indexes are followed recursively. Use includePatterns and excludePatterns to focus large sitemaps before crawling.

Token counts

The Actor includes an estimatedTokenCount field for each page and chunk using a fast approx_chars_per_4 method. This is useful for budgeting embedding jobs and sizing RAG chunks. Treat it as an estimate rather than an exact model-specific tokenizer count.

Example workflow

  1. Enter a website URL or sitemap URL.
  2. Set maxPages and maxDepth to control crawl size.
  3. Use includePatterns / excludePatterns to keep the crawl focused.
  4. Run the Actor.
  5. Export chunk records as JSON/JSONL.
  6. Load those chunks into your vector database or RAG pipeline.

Notes

  • This Actor is optimized for regular HTML pages, blogs, documentation sites, and help centers.
  • JavaScript-heavy single-page apps may need a browser-based crawler variant.
  • Keep maxPages low for first runs, inspect output, then scale up.
  • Disable includePageRecords if you only want chunk records for embedding.