Docs-to-RAG Crawler
Pricing
Pay per event
Docs-to-RAG Crawler
Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) and output chunked markdown ready for RAG/vector DB ingestion. Splits by heading hierarchy, strips nav/sidebar chrome.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share
Crawl any documentation website and export chunked markdown optimized for RAG pipelines, vector databases, and AI knowledge bases. Supports ReadTheDocs, GitBook, Docusaurus, Mintlify, and any generic HTML documentation site.
No API key required. No browser overhead. Pure HTTP crawling for maximum speed.
What does Docs-to-RAG Crawler do?
Docs-to-RAG Crawler visits every page of a documentation site, strips navigation/sidebar/footer chrome, extracts the core content, and splits it into heading-bounded chunks ready for vector DB ingestion.
Each chunk includes: a stable chunkId (hash of URL + heading), the heading hierarchy as a breadcrumb (e.g., "Getting Started > Installation > Requirements"), clean markdown content, word count, and metadata (site name, crawl timestamp, detected platform).
What makes it different from a generic web crawler:
- π§ Understands documentation structure β splits on H2/H3 boundaries, not arbitrary character limits
- ποΈ Detects docs platforms automatically (Docusaurus, ReadTheDocs, GitBook, Mintlify) and applies platform-specific noise removal
- π Follows sidebar/nav links intelligently β finds the full documentation tree, not just pages linked from the start URL
- β‘ Pure HTTP (Cheerio) β 10-50x faster than browser-based crawlers, minimal memory usage
- π¦ Outputs stable chunk IDs β safe to re-crawl and upsert into vector DBs without duplicates
Who is Docs-to-RAG Crawler for?
AI engineers building RAG applications
- Index your product's own documentation into a vector DB for customer-facing chatbots
- Build a "chat with any docs" feature by crawling third-party docs at startup
- Keep your knowledge base fresh by scheduling weekly re-crawls
Developers building internal tools
- Turn scattered internal wikis into a searchable knowledge base
- Feed company documentation into LLM-powered search
- Build code assistants that understand your specific framework's docs
Data teams and researchers
- Benchmark embedding models across real documentation corpora
- Build multi-doc retrieval datasets for fine-tuning
- Compare documentation quality across competing libraries
DevOps and platform teams
- Automate documentation ingestion into Confluence, Notion, or Slack bots
- Trigger re-ingestion automatically on new releases via webhooks
- Monitor documentation coverage gaps by tracking chunk counts per section
Why use Docs-to-RAG Crawler?
- β Works with modern docs sites β Docusaurus v2/v3, ReadTheDocs, GitBook, Mintlify, plain HTML
- β Heading-aware chunking β splits at semantic H2/H3 boundaries, not mid-paragraph
- β Stable chunk IDs β SHA-1 based on URL + heading, safe for upserts on re-crawl
- β Platform-specific noise removal β strips breadcrumbs, version badges, "On this page" widgets, edit buttons
- β Code block control β keep or strip code blocks depending on your embedding model's needs
- β Fast β pure HTTP, no browser, processes 400+ pages/minute
- β No proxy required β documentation sites are public and don't block scrapers
- β Exclude patterns β skip changelog, blog, or release notes pages with glob patterns
- β Scheduling support β set up weekly re-crawls to keep your knowledge base fresh
What data can you extract?
Each output chunk contains:
| Field | Type | Example |
|---|---|---|
chunkId | string | "installation-requirements-a3f8b1c2" |
title | string | "Installation Requirements" |
url | string | "https://docs.example.com/getting-started/install" |
headingHierarchy | string | "Getting Started > Installation > Requirements" |
content | string | Full markdown content of this chunk |
wordCount | number | 142 |
metadata.siteName | string | "My Project Docs" |
metadata.scrapedAt | string | "2026-04-05T10:00:00.000Z" |
metadata.platform | string | "docusaurus" |
Example output chunk:
{"chunkId": "authentication-api-keys-d4e9f1a3","title": "API Keys","url": "https://docs.example.com/authentication","headingHierarchy": "Authentication > API Keys","content": "## API Keys\n\nTo authenticate with the API, pass your API key in the `Authorization` header:\n\n```\ncurl -H 'Authorization: Bearer YOUR_KEY' https://api.example.com/data\n```\n\nAPI keys are scoped to your account and can be revoked at any time from your dashboard.","wordCount": 48,"metadata": {"siteName": "Example Docs","scrapedAt": "2026-04-05T10:00:00.000Z","platform": "docusaurus"}}
How much does it cost to crawl a documentation site?
Docs-to-RAG Crawler uses pay-per-event (PPE) pricing: you pay per page crawled, not per run. There is a small one-time start fee plus a per-page fee.
| Plan | Start fee | Per page fee | Example: 100 pages |
|---|---|---|---|
| FREE | $0.010 | $0.003 | ~$0.31 |
| BRONZE | $0.0095 | $0.0027 | ~$0.28 |
| SILVER | $0.0085 | $0.0024 | ~$0.25 |
| GOLD | $0.0075 | $0.00195 | ~$0.20 |
| PLATINUM | $0.006 | $0.0015 | ~$0.155 |
| DIAMOND | $0.005 | $0.0012 | ~$0.125 |
Real-world examples:
- Small library docs (50 pages) β $0.16 on FREE plan
- Medium framework docs (200 pages) β $0.61 on FREE plan
- Large documentation site (500 pages) β $1.51 on FREE plan
- Enterprise docs with 1000+ pages β $3.01 on FREE plan
Free plan credit: New Apify accounts get $5 in free credits β enough to crawl ~1,600 pages at FREE tier pricing.
How to crawl documentation for RAG
- Go to Docs-to-RAG Crawler on Apify Store
- Click Try for free
- In the Documentation URL field, enter the root URL of the docs you want to crawl (e.g.,
https://docs.example.com) - Set Max pages (start with 20-50 to preview results, then increase)
- Choose Chunk mode:
heading(recommended) splits at H2/H3 boundaries;pageoutputs one chunk per full page - Click Start and wait for the crawl to finish
- Click Export β JSON or CSV to download your chunks
- Load the JSON into your vector database (Pinecone, Weaviate, Chroma, Qdrant, etc.)
Example inputs for common scenarios:
Crawl a Docusaurus site, skip blog and changelog:
{"startUrl": "https://docusaurus.io/docs","maxPages": 200,"chunkMode": "heading","maxChunkWords": 300,"excludePatterns": ["*/blog/*", "*/changelog/*"]}
Crawl a ReadTheDocs project, text-only (no code):
{"startUrl": "https://docs.python-requests.org/en/latest/","maxPages": 50,"chunkMode": "heading","includeCodeBlocks": false}
Crawl a GitBook space, one chunk per page:
{"startUrl": "https://docs.myapp.gitbook.io/","maxPages": 100,"chunkMode": "page","maxChunkWords": 1000}
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrl | string | β | Root URL of the documentation site to crawl (required) |
maxPages | integer | 100 | Maximum number of pages to crawl |
includeCodeBlocks | boolean | true | Whether to include code blocks in chunks |
chunkMode | string | "heading" | "heading" splits at H2/H3; "page" outputs one chunk per full page |
maxChunkWords | integer | 300 | Maximum words per chunk; oversized sections are split at paragraph boundaries |
linkSelector | string | β | Custom CSS selector for navigation links (leave empty for auto-detection) |
excludePatterns | array | [] | URL glob patterns to skip (e.g., "*/blog/*", "*/release-notes/*") |
waitForSelector | string | β | Reserved for future JS-rendering mode |
Output examples
Heading-mode output (chunkMode: "heading"):
[{"chunkId": "quick-start-5fd5eded","title": "Quick Start","url": "https://crawlee.dev/docs/quick-start","headingHierarchy": "Quick Start","content": "# Quick Start\n\nWith this short tutorial you can start scraping with Crawlee in a minute or two.","wordCount": 22,"metadata": {"siteName": "Crawlee","scrapedAt": "2026-04-05T10:00:00.000Z","platform": "docusaurus"}},{"chunkId": "cheerio-crawler-b1a2c3d4","title": "CheerioCrawler","url": "https://crawlee.dev/docs/quick-start","headingHierarchy": "Quick Start > Choose your crawler > CheerioCrawler","content": "### CheerioCrawler\n\nCheerioCrawler downloads each URL using a plain HTTP request and parses the HTML with Cheerio...","wordCount": 64,"metadata": {"siteName": "Crawlee","scrapedAt": "2026-04-05T10:00:00.000Z","platform": "docusaurus"}}]
Tips for best results
- π― Start small β set
maxPages: 20first to preview chunk quality, then increase - π Use the docs root β set
startUrlto the documentation root (e.g.,/docsor/docs/intro), not a specific page - βοΈ Tune chunk size β 200-400 words per chunk works well for most embedding models (~270-530 tokens). Smaller chunks = more precise retrieval but more DB entries
- π« Exclude noise pages β use
excludePatternsto skip changelog, blog, API reference (auto-generated), and release notes pages - π» Code blocks β keep
includeCodeBlocks: truefor code-heavy docs (frameworks, SDKs). Set tofalsefor prose-heavy docs (tutorials, guides) where code snippets reduce embedding quality - π Schedule re-crawls β set up a weekly cron to keep your knowledge base fresh; stable
chunkIds let you upsert without duplicates - π Page mode for long-form content β use
chunkMode: "page"for docs with very long pages and few headings (like API references with a single long page)
Integrations
Docs-to-RAG Crawler β Pinecone (auto-indexed knowledge base)
- Run actor on a schedule (weekly) β export dataset as JSON β load into Pinecone with chunk
contentas text andchunkIdas vector ID β upsert without worrying about duplicates
Docs-to-RAG Crawler β OpenAI Embeddings β Weaviate
- Export chunks as JSON β batch-process through
text-embedding-3-smallβ store vectors withheadingHierarchyandurlas metadata for rich retrieval
Docs-to-RAG Crawler β Make/Zapier β Slack bot
- Trigger on new dataset items β embed each chunk β surface answers in a Slack chatbot that cites source URLs
Docs-to-RAG Crawler β Google Sheets
- Use Apify's Google Sheets integration to export all chunks to a spreadsheet β useful for reviewing chunk quality before ingesting into a vector DB
Docs-to-RAG Crawler β Webhook β LlamaIndex pipeline
- Set up a webhook to trigger your LlamaIndex indexing pipeline immediately when the crawl completes β zero manual steps for fresh documentation indexing
Scheduled re-indexing workflow
- Create a daily/weekly schedule on Apify β actor outputs only pages that exist in the current crawl β diff against your vector DB to add new, update changed, and remove deleted chunks using the stable
chunkId
Using the Apify API
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('automation-lab/docs-rag-crawler').call({startUrl: 'https://docs.example.com',maxPages: 100,chunkMode: 'heading',maxChunkWords: 300,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Collected ${items.length} chunks`);
Python
from apify_client import ApifyClientclient = ApifyClient(token='YOUR_APIFY_TOKEN')run = client.actor('automation-lab/docs-rag-crawler').call(run_input={'startUrl': 'https://docs.example.com','maxPages': 100,'chunkMode': 'heading','maxChunkWords': 300,})items = client.dataset(run['defaultDatasetId']).list_items()['items']print(f'Collected {len(items)} chunks')
cURL
curl -X POST \"https://api.apify.com/v2/acts/automation-lab~docs-rag-crawler/runs" \-H "Authorization: Bearer YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrl": "https://docs.example.com","maxPages": 100,"chunkMode": "heading","maxChunkWords": 300}'
Use with AI agents via MCP
Docs-to-RAG Crawler is available as a tool for AI assistants that support the Model Context Protocol (MCP).
Add the Apify MCP server to your AI client β this gives you access to all Apify actors, including this one:
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com?token=YOUR_APIFY_TOKEN"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"type": "http","url": "https://mcp.apify.com?token=YOUR_APIFY_TOKEN&tools=automation-lab/docs-rag-crawler"}}}
Example prompts
Once connected, you can ask your AI assistant:
- "Crawl https://docs.fastapi.tiangolo.com and output 300-word heading chunks for RAG ingestion"
- "Index the Stripe API docs at https://stripe.com/docs, skip the changelog pages, max 500 pages"
- "Crawl https://docs.langchain.com, text only (no code blocks), and export as chunks for embedding"
Is it legal to crawl documentation sites?
Docs-to-RAG Crawler is designed for ethical use of publicly available documentation.
Most documentation sites explicitly encourage crawling and indexing β that's the point of public documentation. However, always check the site's robots.txt and Terms of Service before crawling at scale.
Best practices:
- Only crawl documentation you have permission to use
- Respect
robots.txtcrawl delays and disallow rules - Don't crawl private or authenticated documentation without authorization
- GDPR: documentation sites rarely contain personal data, but be mindful if internal wikis do
This actor crawls public pages using standard HTTP requests, the same as a web browser. It does not bypass authentication, CAPTCHA systems, or access controls.
FAQ
How many chunks does a typical documentation site produce?
A small library docs site (50 pages) typically produces 200-800 chunks. A large framework like React or Next.js (300+ pages) can produce 2,000-5,000 chunks with heading-mode chunking.
How long does a crawl take?
Very fast β pure HTTP, no JavaScript rendering needed. A 100-page docs site typically completes in 20-30 seconds. A 500-page site in under 2 minutes.
What is the recommended chunk size for RAG?
200-400 words per chunk (~270-530 tokens for GPT-4 tokenizer) is the sweet spot for most retrieval scenarios. Smaller chunks improve precision but require more DB storage and API calls. Larger chunks preserve more context but may dilute retrieval accuracy.
Why are some chunks very short or empty?
Short chunks usually come from heading-only sections or navigation artifacts. The actor automatically skips chunks with fewer than 5 words. Very short but non-empty chunks (10-50 words) are kept because they may contain important introductory text.
Why does the crawler only find some pages and not others?
The crawler discovers pages by following links in the navigation sidebar and content area. If your docs site uses JavaScript to load the sidebar dynamically, Cheerio mode won't see those links. In that case, try setting linkSelector to a CSS selector that targets the nav links in the raw HTML (check the source with Ctrl+U in your browser). GitBook sites that render entirely via JS may need a future Playwright-mode upgrade.
The chunks contain "Version: 3.x" or "On this page" text. How do I remove it?
These artifacts are stripped automatically by the actor's Docusaurus-specific noise filters. If you see them, please open an issue with the URL and we'll add a filter for that platform.
The actor crawled pages from a different subdomain β why?
The crawler only follows links to the same hostname as the startUrl. If you see pages from a different subdomain, it means the docs site has a subdomain structure (e.g., docs.example.com linking to api.example.com). Use excludePatterns to filter those out, or open a feature request for subdomain scoping.
Other documentation and AI tools
Explore more automation-lab actors for AI and data workflows:
- Web Scraper β general purpose web scraping
- JSON Schema Generator β auto-generate JSON schemas from sample data
- Color Contrast Checker β WCAG 2.1 AA/AAA accessibility validation