Pricing

from $0.50 / 1,000 results

Go to Apify Store

Website to RAG Markdown Crawler

Try for free

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Ralph T

Actor stats

Bookmarked

Total users

Monthly active users

19 days ago

Last modified

Quick start

Use a focused sitemap or docs URL first, keep maxPages low, inspect the RAG chunks view, then scale up:

{
  "startUrls": [{ "url": "https://docs.apify.com/sitemap.xml" }],
  "maxPages": 5,
  "maxDepth": 0,
  "expandSitemaps": true,
  "includePatterns": ["^https://docs\\.apify\\.com/platform/actors"],
  "chunkSize": 1200,
  "chunkOverlap": 150,
  "includePageRecords": true
}

What it does

Starts from one or more web pages or sitemap.xml URLs.
Expands sitemap indexes and sitemap URL sets into crawlable page URLs.
Follows same-domain links up to a configurable depth.
Removes navigation/footer/script/style noise.
Converts HTML to clean Markdown.
Emits both full-page records and smaller RAG chunk records.
Adds estimated token counts for pages and chunks.
Includes source URL, title, description, timestamps, character counts, token estimates, and chunk metadata.

Best for

Preparing documentation sites for RAG.
Building AI chatbot or AI support-bot knowledge bases.
Creating clean Markdown from help centers, blogs, changelogs, and product docs.
Turning competitor docs/blogs into structured internal research data.
Feeding LangChain, LlamaIndex, Supabase, Chroma, Pinecone, Qdrant, or custom vector pipelines.

Input example

{
  "startUrls": [{ "url": "https://docs.apify.com/sitemap.xml" }],
  "maxPages": 25,
  "maxDepth": 1,
  "expandSitemaps": true,
  "maxSitemapUrls": 5000,
  "includePatterns": ["^https://docs\\.apify\\.com/platform/actors"],
  "excludePatterns": ["/login", "/signup", "#"],
  "removeSelectors": ["nav", "footer", "script", "style", "noscript", "svg"],
  "chunkSize": 1200,
  "chunkOverlap": 150,
  "sameDomainOnly": true,
  "includePageRecords": true
}

Output records

The Actor defines Apify dataset/output schemas so the Output tab has dedicated views for RAG chunks, full pages, and metadata. The default dataset contains two record types by default.

`page`

Full-page Markdown record:

{
  "recordType": "page",
  "url": "https://example.com/",
  "requestedUrl": "https://example.com/",
  "title": "Example Domain",
  "description": "",
  "source": "sitemap",
  "sitemapUrl": "https://example.com/sitemap.xml",
  "markdown": "# Example Domain...",
  "charCount": 167,
  "estimatedTokenCount": 42,
  "tokenCountMethod": "approx_chars_per_4",
  "chunkCount": 1,
  "crawledAt": "2026-07-04T00:00:00.000Z"
}

`chunk`

RAG-ready chunk record:

{
  "recordType": "chunk",
  "url": "https://example.com/",
  "title": "Example Domain",
  "chunkIndex": 0,
  "chunkCount": 1,
  "text": "# Example Domain...",
  "charCount": 167,
  "estimatedTokenCount": 42,
  "tokenCountMethod": "approx_chars_per_4",
  "metadata": {
    "source": "https://example.com/",
    "title": "Example Domain",
    "crawledAt": "2026-07-04T00:00:00.000Z",
    "sourceType": "sitemap",
    "sitemapUrl": "https://example.com/sitemap.xml"
  }
}

Sitemap support

If a start URL looks like a sitemap, for example https://example.com/sitemap.xml, the Actor extracts URLs from <loc> entries and crawls the matching pages. Sitemap indexes are followed recursively. Use includePatterns and excludePatterns to focus large sitemaps before crawling.

Token counts

The Actor includes an estimatedTokenCount field for each page and chunk using a fast approx_chars_per_4 method. This is useful for budgeting embedding jobs and sizing RAG chunks. Treat it as an estimate rather than an exact model-specific tokenizer count.

Example workflow

Enter a website URL or sitemap URL.
Set maxPages and maxDepth to control crawl size.
Use includePatterns / excludePatterns to keep the crawl focused.
Run the Actor.
Export chunk records as JSON/JSONL.
Load those chunks into your vector database or RAG pipeline.

Notes

This Actor is optimized for regular HTML pages, blogs, documentation sites, and help centers.
JavaScript-heavy single-page apps may need a browser-based crawler variant.
Keep maxPages low for first runs, inspect output, then scale up.
Disable includePageRecords if you only want chunk records for embedding.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Fast Website to Markdown & RAG JSONL Crawler

orbiscribe/website-rag-dataset-builder

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

Orbiscribe Labs

RAG Website Crawler — Markdown & AI Chunks

joaosbp/website-content-crawler

Crawl websites into clean Markdown, deterministic RAG chunks, canonical metadata, content hashes, and deduplicated AI-ready datasets for vector databases and agents.

João Victor

Site Crawler: Website → Markdown Corpus for LLM/RAG

boxbox10/site-crawler

Crawl a whole website or docs site and get one clean, LLM-ready Markdown + JSON record per page (title, headings, content, links, token count). Built for RAG ingestion and AI knowledge bases.

Marvin Eguilos

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

RAG Web Extractor — Clean Markdown, HTML & Chunks

junipr/rag-web-extractor

Extract clean website content for RAG and AI search. Crawl pages, remove boilerplate, preserve structure, and export markdown, HTML, text, JSON, and chunks.