Pricing

Pay per event

Docs-to-RAG Crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

22 days ago

Last modified

What does Docs-to-RAG Crawler do?

Docs-to-RAG Crawler visits every page of a documentation site, strips navigation/sidebar/footer chrome, extracts the core content, and splits it into heading-bounded chunks ready for vector DB ingestion.

Each chunk includes: a stable chunkId (hash of URL + heading), the heading hierarchy as a breadcrumb (e.g., "Getting Started > Installation > Requirements"), clean Markdown content, word count, character count, token estimate, and metadata (site name, crawl timestamp, detected platform). Use maxChunkWords, chunkMode, and includeCodeBlocks to fit your embedding model and retrieval strategy.

What makes it different from a generic web crawler:

🧠 Understands documentation structure — splits on H2/H3 boundaries, not arbitrary character limits
🏗️ Detects docs platforms automatically (Docusaurus, ReadTheDocs, GitBook, Mintlify) and applies platform-specific noise removal
🔗 Follows sidebar/nav links intelligently — finds the full documentation tree, not just pages linked from the start URL
⚡ Pure HTTP (Cheerio) — 10-50x faster than browser-based crawlers, minimal memory usage
📦 Outputs stable chunk IDs — safe to re-crawl and upsert into vector DBs without duplicates
🔢 Adds size and token metadata — plan embedding batches and LLM context windows before ingestion

Who is it for?

AI engineers building RAG applications

Index your product's own documentation into a vector DB for customer-facing chatbots
Build a "chat with any docs" feature by crawling third-party docs at startup
Keep your knowledge base fresh by scheduling weekly re-crawls

Developers building internal tools

Turn scattered internal wikis into a searchable knowledge base
Feed company documentation into LLM-powered search
Build code assistants that understand your specific framework's docs

Data teams and researchers

Benchmark embedding models across real documentation corpora
Build multi-doc retrieval datasets for fine-tuning
Compare documentation quality across competing libraries

DevOps and platform teams

Automate documentation ingestion into Confluence, Notion, or Slack bots
Trigger re-ingestion automatically on new releases via webhooks
Monitor documentation coverage gaps by tracking chunk counts per section

Why use Docs-to-RAG Crawler?

✅ Works with modern docs sites — Docusaurus v2/v3, ReadTheDocs, GitBook, Mintlify, plain HTML
✅ Heading-aware chunking — splits at semantic H2/H3 boundaries, not mid-paragraph
✅ Stable chunk IDs — SHA-1 based on URL + heading, safe for upserts on re-crawl
✅ Platform-specific noise removal — strips breadcrumbs, version badges, "On this page" widgets, edit buttons
✅ Code block control — keep or strip code blocks depending on your embedding model's needs
✅ Fast — pure HTTP, no browser, processes 400+ pages/minute
✅ No proxy required — documentation sites are public and don't block scrapers
✅ Exclude patterns — skip changelog, blog, or release notes pages with glob patterns
✅ Scheduling support — set up weekly re-crawls to keep your knowledge base fresh
✅ RAG-ready fields — stable chunkId, headingHierarchy, content, charCount, and tokenEstimate in every JSON result

What data can you extract?

Each output chunk contains:

Field	Type	Example
`chunkId`	string	`"installation-requirements-a3f8b1c2"`
`title`	string	`"Installation Requirements"`
`url`	string	`"https://docs.example.com/getting-started/install"`
`headingHierarchy`	string	`"Getting Started > Installation > Requirements"`
`content`	string	Full markdown content of this chunk
`wordCount`	number	`142`
`charCount`	number	`812`
`tokenEstimate`	number	`203` (approx. `charCount / 4`)
`metadata.siteName`	string	`"My Project Docs"`
`metadata.scrapedAt`	string	`"2026-04-05T10:00:00.000Z"`
`metadata.platform`	string	`"docusaurus"`

Example output chunk:

{
    "chunkId": "authentication-api-keys-d4e9f1a3",
    "title": "API Keys",
    "url": "https://docs.example.com/authentication",
    "headingHierarchy": "Authentication > API Keys",
    "content": "## API Keys\n\nTo authenticate with the API, pass your API key in the `Authorization` header:\n\n```\ncurl -H 'Authorization: Bearer YOUR_KEY' https://api.example.com/data\n```\n\nAPI keys are scoped to your account and can be revoked at any time from your dashboard.",
    "wordCount": 48,
    "charCount": 296,
    "tokenEstimate": 74,
    "metadata": {
        "siteName": "Example Docs",
        "scrapedAt": "2026-04-05T10:00:00.000Z",
        "platform": "docusaurus"
    }
}

How much does it cost to crawl a documentation site?

Docs-to-RAG Crawler uses pay-per-event (PPE) pricing: you pay per page crawled, not per run. There is a small one-time start fee plus a per-page fee.

Plan	Start fee	Per page fee	Example: 100 pages
FREE	$0.010	$0.003	~$0.31
BRONZE	$0.0095	$0.0027	~$0.28
SILVER	$0.0085	$0.0024	~$0.25
GOLD	$0.0075	$0.00195	~$0.20
PLATINUM	$0.006	$0.0015	~$0.155
DIAMOND	$0.005	$0.0012	~$0.125

Real-world examples:

Small library docs (50 pages) ≈ $0.16 on FREE plan
Medium framework docs (200 pages) ≈ $0.61 on FREE plan
Large documentation site (500 pages) ≈ $1.51 on FREE plan
Enterprise docs with 1000+ pages ≈ $3.01 on FREE plan

Free plan credit: New Apify accounts get $5 in free credits — enough to crawl ~1,600 pages at FREE tier pricing.

How to crawl documentation for RAG

Go to Docs-to-RAG Crawler on Apify Store
Click Try for free
In the Documentation URL field, enter the root URL of the docs you want to crawl (e.g., https://docs.example.com)
Set Max pages (start with 20-50 to preview results, then increase)
Choose Chunk mode: heading (recommended) splits at H2/H3 boundaries; page outputs one chunk per full page
Click Start and wait for the crawl to finish
Click Export → JSON or CSV to download your chunks
Load the JSON into your vector database (Pinecone, Weaviate, Chroma, Qdrant, etc.)

Example inputs for common scenarios:

Crawl a Docusaurus site, skip blog and changelog:

{
    "startUrl": "https://docusaurus.io/docs",
    "maxPages": 200,
    "chunkMode": "heading",
    "maxChunkWords": 300,
    "excludePatterns": ["*/blog/*", "*/changelog/*"]
}

Crawl a ReadTheDocs project, text-only (no code):

{
    "startUrl": "https://docs.python-requests.org/en/latest/",
    "maxPages": 50,
    "chunkMode": "heading",
    "includeCodeBlocks": false
}

Crawl a GitBook space, one chunk per page:

{
    "startUrl": "https://docs.myapp.gitbook.io/",
    "maxPages": 100,
    "chunkMode": "page",
    "maxChunkWords": 1000
}

Input parameters

Parameter	Type	Default	Description
`startUrl`	string	—	Root URL of the documentation site to crawl (required)
`maxPages`	integer	100	Maximum number of pages to crawl
`includeCodeBlocks`	boolean	true	Whether to include code blocks in chunks
`chunkMode`	string	`"heading"`	`"heading"` splits at H2/H3; `"page"` outputs one chunk per full page
`maxChunkWords`	integer	300	Maximum words per chunk; oversized sections are split at paragraph boundaries
`linkSelector`	string	—	Custom CSS selector for navigation links (leave empty for auto-detection)
`excludePatterns`	array	[]	URL glob patterns to skip (e.g., `"/blog/"`, `"/release-notes/"`)
`waitForSelector`	string	—	Reserved for future JS-rendering mode

Output examples

Heading-mode output (chunkMode: "heading"):

[
    {
        "chunkId": "quick-start-5fd5eded",
        "title": "Quick Start",
        "url": "https://crawlee.dev/docs/quick-start",
        "headingHierarchy": "Quick Start",
        "content": "# Quick Start\n\nWith this short tutorial you can start scraping with Crawlee in a minute or two.",
        "wordCount": 22,
        "charCount": 101,
        "tokenEstimate": 26,
        "metadata": {
            "siteName": "Crawlee",
            "scrapedAt": "2026-04-05T10:00:00.000Z",
            "platform": "docusaurus"
        }
    },
    {
        "chunkId": "cheerio-crawler-b1a2c3d4",
        "title": "CheerioCrawler",
        "url": "https://crawlee.dev/docs/quick-start",
        "headingHierarchy": "Quick Start > Choose your crawler > CheerioCrawler",
        "content": "### CheerioCrawler\n\nCheerioCrawler downloads each URL using a plain HTTP request and parses the HTML with Cheerio...",
        "wordCount": 64,
        "charCount": 356,
        "tokenEstimate": 89,
        "metadata": {
            "siteName": "Crawlee",
            "scrapedAt": "2026-04-05T10:00:00.000Z",
            "platform": "docusaurus"
        }
    }
]

Tips for best results

🎯 Start small — set maxPages: 20 first to preview chunk quality, then increase
🔗 Use the docs root — set startUrl to the documentation root (e.g., /docs or /docs/intro), not a specific page
✂️ Tune chunk size — 200-400 words per chunk works well for most embedding models (~270-530 tokens). Smaller chunks = more precise retrieval but more DB entries
🚫 Exclude noise pages — use excludePatterns to skip changelog, blog, API reference (auto-generated), and release notes pages
💻 Code blocks — keep includeCodeBlocks: true for code-heavy docs (frameworks, SDKs). Set to false for prose-heavy docs (tutorials, guides) where code snippets reduce embedding quality
🔄 Schedule re-crawls — set up a weekly cron to keep your knowledge base fresh; stable chunkIds let you upsert without duplicates
📊 Page mode for long-form content — use chunkMode: "page" for docs with very long pages and few headings (like API references with a single long page)

Integrations

Docs-to-RAG Crawler → Pinecone (auto-indexed knowledge base)

Run actor on a schedule (weekly) → export dataset as JSON → load into Pinecone with chunk content as text and chunkId as vector ID → upsert without worrying about duplicates

Docs-to-RAG Crawler → OpenAI Embeddings → Weaviate

Export chunks as JSON → batch-process through text-embedding-3-small → store vectors with headingHierarchy and url as metadata for rich retrieval

Docs-to-RAG Crawler → Make/Zapier → Slack bot

Trigger on new dataset items → embed each chunk → surface answers in a Slack chatbot that cites source URLs

Docs-to-RAG Crawler → Google Sheets

Use Apify's Google Sheets integration to export all chunks to a spreadsheet — useful for reviewing chunk quality before ingesting into a vector DB

Docs-to-RAG Crawler → Webhook → LlamaIndex pipeline

Set up a webhook to trigger your LlamaIndex indexing pipeline immediately when the crawl completes — zero manual steps for fresh documentation indexing

Scheduled re-indexing workflow

Create a daily/weekly schedule on Apify → actor outputs only pages that exist in the current crawl → diff against your vector DB to add new, update changed, and remove deleted chunks using the stable chunkId

API Usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/docs-rag-crawler').call({
    startUrl: 'https://docs.example.com',
    maxPages: 100,
    chunkMode: 'heading',
    maxChunkWords: 300,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Collected ${items.length} chunks`);

Python

from apify_client import ApifyClient

client = ApifyClient(token='YOUR_APIFY_TOKEN')

run = client.actor('automation-lab/docs-rag-crawler').call(run_input={
    'startUrl': 'https://docs.example.com',
    'maxPages': 100,
    'chunkMode': 'heading',
    'maxChunkWords': 300,
})

items = client.dataset(run['defaultDatasetId']).list_items()['items']
print(f'Collected {len(items)} chunks')

cURL

curl -X POST \
  "https://api.apify.com/v2/acts/automation-lab~docs-rag-crawler/runs" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://docs.example.com",
    "maxPages": 100,
    "chunkMode": "heading",
    "maxChunkWords": 300
  }'

Use with AI agents via MCP

Docs-to-RAG Crawler is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?token=YOUR_APIFY_TOKEN"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "type": "http",
            "url": "https://mcp.apify.com?token=YOUR_APIFY_TOKEN&tools=automation-lab/docs-rag-crawler"
        }
    }
}

Example prompts

Once connected, you can ask your AI assistant:

"Crawl https://docs.fastapi.tiangolo.com and output 300-word heading chunks for RAG ingestion"
"Index the Stripe API docs at https://stripe.com/docs, skip the changelog pages, max 500 pages"
"Crawl https://docs.langchain.com, text only (no code blocks), and export as chunks for embedding"

Legality: is it legal to crawl documentation sites?

Docs-to-RAG Crawler is designed for ethical use of publicly available documentation.

Most documentation sites explicitly encourage crawling and indexing — that's the point of public documentation. However, always check the site's robots.txt and Terms of Service before crawling at scale.

Best practices:

Only crawl documentation you have permission to use
Respect robots.txt crawl delays and disallow rules
Don't crawl private or authenticated documentation without authorization
GDPR: documentation sites rarely contain personal data, but be mindful if internal wikis do

This actor crawls public pages using standard HTTP requests, the same as a web browser. It does not bypass authentication, CAPTCHA systems, or access controls.

FAQ

How many chunks does a typical documentation site produce?

A small library docs site (50 pages) typically produces 200-800 chunks. A large framework like React or Next.js (300+ pages) can produce 2,000-5,000 chunks with heading-mode chunking.

How long does a crawl take?

Very fast — pure HTTP, no JavaScript rendering needed. A 100-page docs site typically completes in 20-30 seconds. A 500-page site in under 2 minutes.

What is the recommended chunk size for RAG?

200-400 words per chunk (~270-530 tokens for GPT-4 tokenizer) is the sweet spot for most retrieval scenarios. Smaller chunks improve precision but require more DB storage and API calls. Larger chunks preserve more context but may dilute retrieval accuracy.

Why are some chunks very short or empty?

Short chunks usually come from heading-only sections or navigation artifacts. The actor automatically skips chunks with fewer than 5 words. Very short but non-empty chunks (10-50 words) are kept because they may contain important introductory text.

Why does the crawler only find some pages and not others?

The crawler discovers pages by following links in the navigation sidebar and content area. If your docs site uses JavaScript to load the sidebar dynamically, Cheerio mode won't see those links. In that case, try setting linkSelector to a CSS selector that targets the nav links in the raw HTML (check the source with Ctrl+U in your browser). GitBook sites that render entirely via JS may need a future Playwright-mode upgrade.

The chunks contain "Version: 3.x" or "On this page" text. How do I remove it?

These artifacts are stripped automatically by the actor's Docusaurus-specific noise filters. If you see them, please open an issue with the URL and we'll add a filter for that platform.

The actor crawled pages from a different subdomain — why?

The crawler only follows links to the same hostname as the startUrl. If you see pages from a different subdomain, it means the docs site has a subdomain structure (e.g., docs.example.com linking to api.example.com). Use excludePatterns to filter those out, or open a feature request for subdomain scoping.

Explore more automation-lab actors for AI and data workflows:

Web Scraper — general purpose web scraping with JavaScript support
JSON Schema Generator — auto-generate JSON schemas from sample data
Color Contrast Checker — WCAG 2.1 AA/AAA accessibility validation

Documentation Crawler for RAG

liquid_bark/docs-crawler-for-rag

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

Izz

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

RAG Knowledge Loader

botflowtech/rag-knowledge-loader

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

BotFlowTech

RAG Docs Extractor - Documentation to Chunks

ambitious_door/ragdocs-extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

C. K.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

CodePoetry

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.