Docs-to-RAG Crawler avatar

Docs-to-RAG Crawler

Pricing

Pay per event

Go to Apify Store
Docs-to-RAG Crawler

Docs-to-RAG Crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) and output chunked markdown ready for RAG/vector DB ingestion. Splits by heading hierarchy, strips nav/sidebar chrome.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Crawl any documentation website and export chunked markdown optimized for RAG pipelines, vector databases, and AI knowledge bases. Supports ReadTheDocs, GitBook, Docusaurus, Mintlify, and any generic HTML documentation site.

No API key required. No browser overhead. Pure HTTP crawling for maximum speed.

What does Docs-to-RAG Crawler do?

Docs-to-RAG Crawler visits every page of a documentation site, strips navigation/sidebar/footer chrome, extracts the core content, and splits it into heading-bounded chunks ready for vector DB ingestion.

Each chunk includes: a stable chunkId (hash of URL + heading), the heading hierarchy as a breadcrumb (e.g., "Getting Started > Installation > Requirements"), clean markdown content, word count, and metadata (site name, crawl timestamp, detected platform).

What makes it different from a generic web crawler:

  • 🧠 Understands documentation structure β€” splits on H2/H3 boundaries, not arbitrary character limits
  • πŸ—οΈ Detects docs platforms automatically (Docusaurus, ReadTheDocs, GitBook, Mintlify) and applies platform-specific noise removal
  • πŸ”— Follows sidebar/nav links intelligently β€” finds the full documentation tree, not just pages linked from the start URL
  • ⚑ Pure HTTP (Cheerio) β€” 10-50x faster than browser-based crawlers, minimal memory usage
  • πŸ“¦ Outputs stable chunk IDs β€” safe to re-crawl and upsert into vector DBs without duplicates

Who is Docs-to-RAG Crawler for?

AI engineers building RAG applications

  • Index your product's own documentation into a vector DB for customer-facing chatbots
  • Build a "chat with any docs" feature by crawling third-party docs at startup
  • Keep your knowledge base fresh by scheduling weekly re-crawls

Developers building internal tools

  • Turn scattered internal wikis into a searchable knowledge base
  • Feed company documentation into LLM-powered search
  • Build code assistants that understand your specific framework's docs

Data teams and researchers

  • Benchmark embedding models across real documentation corpora
  • Build multi-doc retrieval datasets for fine-tuning
  • Compare documentation quality across competing libraries

DevOps and platform teams

  • Automate documentation ingestion into Confluence, Notion, or Slack bots
  • Trigger re-ingestion automatically on new releases via webhooks
  • Monitor documentation coverage gaps by tracking chunk counts per section

Why use Docs-to-RAG Crawler?

  • βœ… Works with modern docs sites β€” Docusaurus v2/v3, ReadTheDocs, GitBook, Mintlify, plain HTML
  • βœ… Heading-aware chunking β€” splits at semantic H2/H3 boundaries, not mid-paragraph
  • βœ… Stable chunk IDs β€” SHA-1 based on URL + heading, safe for upserts on re-crawl
  • βœ… Platform-specific noise removal β€” strips breadcrumbs, version badges, "On this page" widgets, edit buttons
  • βœ… Code block control β€” keep or strip code blocks depending on your embedding model's needs
  • βœ… Fast β€” pure HTTP, no browser, processes 400+ pages/minute
  • βœ… No proxy required β€” documentation sites are public and don't block scrapers
  • βœ… Exclude patterns β€” skip changelog, blog, or release notes pages with glob patterns
  • βœ… Scheduling support β€” set up weekly re-crawls to keep your knowledge base fresh

What data can you extract?

Each output chunk contains:

FieldTypeExample
chunkIdstring"installation-requirements-a3f8b1c2"
titlestring"Installation Requirements"
urlstring"https://docs.example.com/getting-started/install"
headingHierarchystring"Getting Started > Installation > Requirements"
contentstringFull markdown content of this chunk
wordCountnumber142
metadata.siteNamestring"My Project Docs"
metadata.scrapedAtstring"2026-04-05T10:00:00.000Z"
metadata.platformstring"docusaurus"

Example output chunk:

{
"chunkId": "authentication-api-keys-d4e9f1a3",
"title": "API Keys",
"url": "https://docs.example.com/authentication",
"headingHierarchy": "Authentication > API Keys",
"content": "## API Keys\n\nTo authenticate with the API, pass your API key in the `Authorization` header:\n\n```\ncurl -H 'Authorization: Bearer YOUR_KEY' https://api.example.com/data\n```\n\nAPI keys are scoped to your account and can be revoked at any time from your dashboard.",
"wordCount": 48,
"metadata": {
"siteName": "Example Docs",
"scrapedAt": "2026-04-05T10:00:00.000Z",
"platform": "docusaurus"
}
}

How much does it cost to crawl a documentation site?

Docs-to-RAG Crawler uses pay-per-event (PPE) pricing: you pay per page crawled, not per run. There is a small one-time start fee plus a per-page fee.

PlanStart feePer page feeExample: 100 pages
FREE$0.010$0.003~$0.31
BRONZE$0.0095$0.0027~$0.28
SILVER$0.0085$0.0024~$0.25
GOLD$0.0075$0.00195~$0.20
PLATINUM$0.006$0.0015~$0.155
DIAMOND$0.005$0.0012~$0.125

Real-world examples:

  • Small library docs (50 pages) β‰ˆ $0.16 on FREE plan
  • Medium framework docs (200 pages) β‰ˆ $0.61 on FREE plan
  • Large documentation site (500 pages) β‰ˆ $1.51 on FREE plan
  • Enterprise docs with 1000+ pages β‰ˆ $3.01 on FREE plan

Free plan credit: New Apify accounts get $5 in free credits β€” enough to crawl ~1,600 pages at FREE tier pricing.

How to crawl documentation for RAG

  1. Go to Docs-to-RAG Crawler on Apify Store
  2. Click Try for free
  3. In the Documentation URL field, enter the root URL of the docs you want to crawl (e.g., https://docs.example.com)
  4. Set Max pages (start with 20-50 to preview results, then increase)
  5. Choose Chunk mode: heading (recommended) splits at H2/H3 boundaries; page outputs one chunk per full page
  6. Click Start and wait for the crawl to finish
  7. Click Export β†’ JSON or CSV to download your chunks
  8. Load the JSON into your vector database (Pinecone, Weaviate, Chroma, Qdrant, etc.)

Example inputs for common scenarios:

Crawl a Docusaurus site, skip blog and changelog:

{
"startUrl": "https://docusaurus.io/docs",
"maxPages": 200,
"chunkMode": "heading",
"maxChunkWords": 300,
"excludePatterns": ["*/blog/*", "*/changelog/*"]
}

Crawl a ReadTheDocs project, text-only (no code):

{
"startUrl": "https://docs.python-requests.org/en/latest/",
"maxPages": 50,
"chunkMode": "heading",
"includeCodeBlocks": false
}

Crawl a GitBook space, one chunk per page:

{
"startUrl": "https://docs.myapp.gitbook.io/",
"maxPages": 100,
"chunkMode": "page",
"maxChunkWords": 1000
}

Input parameters

ParameterTypeDefaultDescription
startUrlstringβ€”Root URL of the documentation site to crawl (required)
maxPagesinteger100Maximum number of pages to crawl
includeCodeBlocksbooleantrueWhether to include code blocks in chunks
chunkModestring"heading""heading" splits at H2/H3; "page" outputs one chunk per full page
maxChunkWordsinteger300Maximum words per chunk; oversized sections are split at paragraph boundaries
linkSelectorstringβ€”Custom CSS selector for navigation links (leave empty for auto-detection)
excludePatternsarray[]URL glob patterns to skip (e.g., "*/blog/*", "*/release-notes/*")
waitForSelectorstringβ€”Reserved for future JS-rendering mode

Output examples

Heading-mode output (chunkMode: "heading"):

[
{
"chunkId": "quick-start-5fd5eded",
"title": "Quick Start",
"url": "https://crawlee.dev/docs/quick-start",
"headingHierarchy": "Quick Start",
"content": "# Quick Start\n\nWith this short tutorial you can start scraping with Crawlee in a minute or two.",
"wordCount": 22,
"metadata": {
"siteName": "Crawlee",
"scrapedAt": "2026-04-05T10:00:00.000Z",
"platform": "docusaurus"
}
},
{
"chunkId": "cheerio-crawler-b1a2c3d4",
"title": "CheerioCrawler",
"url": "https://crawlee.dev/docs/quick-start",
"headingHierarchy": "Quick Start > Choose your crawler > CheerioCrawler",
"content": "### CheerioCrawler\n\nCheerioCrawler downloads each URL using a plain HTTP request and parses the HTML with Cheerio...",
"wordCount": 64,
"metadata": {
"siteName": "Crawlee",
"scrapedAt": "2026-04-05T10:00:00.000Z",
"platform": "docusaurus"
}
}
]

Tips for best results

  • 🎯 Start small β€” set maxPages: 20 first to preview chunk quality, then increase
  • πŸ”— Use the docs root β€” set startUrl to the documentation root (e.g., /docs or /docs/intro), not a specific page
  • βœ‚οΈ Tune chunk size β€” 200-400 words per chunk works well for most embedding models (~270-530 tokens). Smaller chunks = more precise retrieval but more DB entries
  • 🚫 Exclude noise pages β€” use excludePatterns to skip changelog, blog, API reference (auto-generated), and release notes pages
  • πŸ’» Code blocks β€” keep includeCodeBlocks: true for code-heavy docs (frameworks, SDKs). Set to false for prose-heavy docs (tutorials, guides) where code snippets reduce embedding quality
  • πŸ”„ Schedule re-crawls β€” set up a weekly cron to keep your knowledge base fresh; stable chunkIds let you upsert without duplicates
  • πŸ“Š Page mode for long-form content β€” use chunkMode: "page" for docs with very long pages and few headings (like API references with a single long page)

Integrations

Docs-to-RAG Crawler β†’ Pinecone (auto-indexed knowledge base)

  • Run actor on a schedule (weekly) β†’ export dataset as JSON β†’ load into Pinecone with chunk content as text and chunkId as vector ID β†’ upsert without worrying about duplicates

Docs-to-RAG Crawler β†’ OpenAI Embeddings β†’ Weaviate

  • Export chunks as JSON β†’ batch-process through text-embedding-3-small β†’ store vectors with headingHierarchy and url as metadata for rich retrieval

Docs-to-RAG Crawler β†’ Make/Zapier β†’ Slack bot

  • Trigger on new dataset items β†’ embed each chunk β†’ surface answers in a Slack chatbot that cites source URLs

Docs-to-RAG Crawler β†’ Google Sheets

  • Use Apify's Google Sheets integration to export all chunks to a spreadsheet β€” useful for reviewing chunk quality before ingesting into a vector DB

Docs-to-RAG Crawler β†’ Webhook β†’ LlamaIndex pipeline

  • Set up a webhook to trigger your LlamaIndex indexing pipeline immediately when the crawl completes β€” zero manual steps for fresh documentation indexing

Scheduled re-indexing workflow

  • Create a daily/weekly schedule on Apify β†’ actor outputs only pages that exist in the current crawl β†’ diff against your vector DB to add new, update changed, and remove deleted chunks using the stable chunkId

Using the Apify API

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/docs-rag-crawler').call({
startUrl: 'https://docs.example.com',
maxPages: 100,
chunkMode: 'heading',
maxChunkWords: 300,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Collected ${items.length} chunks`);

Python

from apify_client import ApifyClient
client = ApifyClient(token='YOUR_APIFY_TOKEN')
run = client.actor('automation-lab/docs-rag-crawler').call(run_input={
'startUrl': 'https://docs.example.com',
'maxPages': 100,
'chunkMode': 'heading',
'maxChunkWords': 300,
})
items = client.dataset(run['defaultDatasetId']).list_items()['items']
print(f'Collected {len(items)} chunks')

cURL

curl -X POST \
"https://api.apify.com/v2/acts/automation-lab~docs-rag-crawler/runs" \
-H "Authorization: Bearer YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrl": "https://docs.example.com",
"maxPages": 100,
"chunkMode": "heading",
"maxChunkWords": 300
}'

Use with AI agents via MCP

Docs-to-RAG Crawler is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client β€” this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?token=YOUR_APIFY_TOKEN"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"type": "http",
"url": "https://mcp.apify.com?token=YOUR_APIFY_TOKEN&tools=automation-lab/docs-rag-crawler"
}
}
}

Example prompts

Once connected, you can ask your AI assistant:

Docs-to-RAG Crawler is designed for ethical use of publicly available documentation.

Most documentation sites explicitly encourage crawling and indexing β€” that's the point of public documentation. However, always check the site's robots.txt and Terms of Service before crawling at scale.

Best practices:

  • Only crawl documentation you have permission to use
  • Respect robots.txt crawl delays and disallow rules
  • Don't crawl private or authenticated documentation without authorization
  • GDPR: documentation sites rarely contain personal data, but be mindful if internal wikis do

This actor crawls public pages using standard HTTP requests, the same as a web browser. It does not bypass authentication, CAPTCHA systems, or access controls.

FAQ

How many chunks does a typical documentation site produce?

A small library docs site (50 pages) typically produces 200-800 chunks. A large framework like React or Next.js (300+ pages) can produce 2,000-5,000 chunks with heading-mode chunking.

How long does a crawl take?

Very fast β€” pure HTTP, no JavaScript rendering needed. A 100-page docs site typically completes in 20-30 seconds. A 500-page site in under 2 minutes.

What is the recommended chunk size for RAG?

200-400 words per chunk (~270-530 tokens for GPT-4 tokenizer) is the sweet spot for most retrieval scenarios. Smaller chunks improve precision but require more DB storage and API calls. Larger chunks preserve more context but may dilute retrieval accuracy.

Why are some chunks very short or empty?

Short chunks usually come from heading-only sections or navigation artifacts. The actor automatically skips chunks with fewer than 5 words. Very short but non-empty chunks (10-50 words) are kept because they may contain important introductory text.

Why does the crawler only find some pages and not others?

The crawler discovers pages by following links in the navigation sidebar and content area. If your docs site uses JavaScript to load the sidebar dynamically, Cheerio mode won't see those links. In that case, try setting linkSelector to a CSS selector that targets the nav links in the raw HTML (check the source with Ctrl+U in your browser). GitBook sites that render entirely via JS may need a future Playwright-mode upgrade.

The chunks contain "Version: 3.x" or "On this page" text. How do I remove it?

These artifacts are stripped automatically by the actor's Docusaurus-specific noise filters. If you see them, please open an issue with the URL and we'll add a filter for that platform.

The actor crawled pages from a different subdomain β€” why?

The crawler only follows links to the same hostname as the startUrl. If you see pages from a different subdomain, it means the docs site has a subdomain structure (e.g., docs.example.com linking to api.example.com). Use excludePatterns to filter those out, or open a feature request for subdomain scoping.

Other documentation and AI tools

Explore more automation-lab actors for AI and data workflows: