Documentation Crawler for RAG
Pricing
Pay per usage
Documentation Crawler for RAG
Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Izz
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share
What is Documentation Crawler for RAG?
This Actor crawls developer documentation websites and converts them into semantically chunked Markdown — ready for RAG pipelines, vector databases, and LLM applications. It automatically detects whether a site is built with Docusaurus, GitBook, ReadTheDocs, MkDocs, or Sphinx, and uses framework-specific selectors to extract only the documentation content — no sidebars, navigation, footers, or version badges.
Generic crawlers dump entire pages into one text blob, including all the UI noise. That degrades your RAG retrieval quality. This Actor solves that by understanding how documentation frameworks structure their HTML, and extracting only what matters.
What can this Actor do?
-
Detect documentation frameworks automatically. Analyzes meta tags, CSS classes, URL patterns, and HTML structure to identify the framework. Each framework gets its own content selectors — Sphinx uses
.body, Docusaurus uses.theme-doc-markdown, MkDocs uses.md-content__inner, and so on. -
Split content into semantic chunks. Splits at heading boundaries (H1 → H2 → H3) instead of by character count. Each chunk carries its heading path as metadata — e.g.,
"section": "API > Authentication > OAuth". Code blocks are never split across chunks. -
Preserve documentation hierarchy. Each chunk includes breadcrumbs (page position in the docs tree) and section paths (heading hierarchy within the page). Your RAG system can filter by topic or weight results by depth.
-
Use llms.txt when available. Checks for
/llms-full.txtbefore crawling. If found, uses it directly — one HTTP request instead of crawling hundreds of pages. About 52% of major docs sites support this. -
Deduplicate with content hashes. SHA-256 hash per chunk. Compare hashes between crawl runs to detect changes without diffing full text.
-
Work on any site. Unknown frameworks fall back to generic selectors (
article,main,[role='main']). The Actor is not limited to the five supported frameworks.
What data does this Actor extract?
| Field | Example | Description |
|---|---|---|
url | https://docs.python.org/3/library/json.html | Source page URL |
title | json — JSON encoder and decoder | Page title (cleaned) |
content | ## Basic Usage\n\njson.dump(obj, fp)... | Clean Markdown, one chunk per item |
breadcrumbs | ["Python Standard Library", "Internet Data Handling"] | Page position in the docs tree |
framework | sphinx | Detected framework |
section | json > Basic Usage | Heading path within the page |
contentHash | a9925077f5be3d02... | SHA-256 for deduplication |
chunkIndex / totalChunks | 2 / 22 | Chunk position within the page |
How to crawl documentation for your RAG pipeline
- Click Try for free to open the Actor in Apify Console.
- Enter one or more documentation URLs in Start URLs — e.g.,
https://docs.python.org/3/library/orhttps://react.dev/learn. - Set Max Pages (default: 100). The Actor discovers pages via sitemap.xml and link-following, so you only need the root URL.
- Choose Output Format:
jsonfor semantic chunks (best for RAG),markdownfor full pages, orjsonlfor streaming. - Click Start. The Actor detects the framework, crawls, chunks, and pushes results to the dataset.
- Download from the Dataset tab as JSON, CSV, or Excel — or fetch via API.
Using the Apify API
curl -X POST "https://api.apify.com/v2/acts/liquid_bark~docs-crawler-for-rag/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{ "url": "https://docs.python.org/3/" }],"maxPages": 50,"outputFormat": "json","chunkSize": 1500}'
Also available via Python SDK and JavaScript SDK.
Integrating with a vector database
The JSON output is designed for direct ingestion into Pinecone, Weaviate, Qdrant, Chroma, or pgvector:
- Run the Actor to crawl a documentation site.
- Fetch the dataset:
GET /v2/datasets/{datasetId}/items. - For each item: embed
content, usecontentHashas dedup key, storesection+breadcrumbsas metadata. - On subsequent crawls, compare hashes to upsert only changed chunks.
How much does it cost to crawl documentation?
$2.00 per 1,000 output items (pay-per-event). One item = one chunk in JSON mode or one page in Markdown mode. A typical page produces 3–5 chunks.
| Scenario | Pages | Items | Event cost | Compute | Total |
|---|---|---|---|---|---|
Single library (e.g., Python json) | 1 | ~22 | ~$0.04 | <$0.01 | ~$0.05 |
| Small docs site | 50 | ~200 | ~$0.40 | ~$0.03 | ~$0.43 |
| Medium docs site | 100 | ~400 | ~$0.80 | ~$0.05 | ~$0.85 |
| Large docs site | 500 | ~2,000 | ~$4.00 | ~$0.25 | ~$4.25 |
| Full page export (Markdown) | 100 | 100 | ~$0.20 | ~$0.05 | ~$0.25 |
Sites with renderJs: true use more compute (headless browser).
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs to start crawling. Follows links within the same domain. Max 100 URLs. |
maxPages | integer | 100 | Maximum pages to crawl. 0 = unlimited (capped at 50,000). |
outputFormat | string | json | json (chunks + metadata), markdown (full pages), jsonl (streaming). |
chunkSize | integer | 1500 | Target chunk size in characters (200–10,000). May exceed target to keep code blocks intact. |
framework | string | auto | auto, docusaurus, gitbook, readthedocs, mkdocs, sphinx. Override if auto-detection fails. |
renderJs | boolean | false | Headless browser for JS-heavy sites (GitBook, newer Docusaurus). Slower, more compute. |
useSitemap | boolean | true | Discover pages via sitemap.xml. Disable if sitemap is broken. |
Output
Real example from crawling Python json docs — this single page produced 22 chunks:
{"url": "https://docs.python.org/3/library/json.html","title": "json — JSON encoder and decoder","content": "## Basic Usage\n\njson.dump(*obj*, *fp*, ***, *skipkeys=False*, ...)\n\nSerialize *obj* as a JSON formatted stream to *fp*...","metadata": {"breadcrumbs": [],"framework": "sphinx","section": "json — JSON encoder and decoder > Basic Usage","contentHash": "a9925077f5be3d02e8f1c4a7b6d8e9f0...","chunkIndex": 2,"totalChunks": 22}}
Each chunk knows its section path, framework, and position. Your RAG pipeline can use section to filter by topic and contentHash for incremental updates.
Supported frameworks
| Framework | Detected via | Extracts from | Example sites |
|---|---|---|---|
| Docusaurus | Meta generator, data-theme, theme CSS classes | .theme-doc-markdown | React, Crawlee, Jest |
| GitBook | URL domain, gitbook-root class, OpenGraph | .page-body, .markdown-section | Startup docs (often needs renderJs: true) |
| ReadTheDocs | URL domain, rst-content class | .rst-content | Django, Flask, Python projects |
| MkDocs | Generator meta tag, md-content, data-md-component | .md-content__inner | Pydantic, FastAPI |
| Sphinx | Generator meta tag, sphinxsidebar, _static/ links | .body | Python stdlib, Linux kernel |
| Unknown | — | article, main, [role='main'] | Any site with semantic HTML |
For each framework, the Actor removes framework-specific noise (sidebars, TOC, pagination, edit links, version selectors) before extracting content. This is the key difference to generic crawlers: a Docusaurus sidebar and a Sphinx sidebar have completely different HTML — the Actor knows how to strip both.
If you know which framework your target site uses, you can set the framework parameter to skip auto-detection. This is useful for sites where detection signals are hidden behind client-side rendering.
llms.txt support
Before crawling, the Actor checks if the site provides llms.txt files:
llms-full.txtfound — Uses it directly as content source. No crawling needed — one HTTP request instead of hundreds of pages. The content is chunked using the same semantic splitting as crawled pages.llms.txtfound (nollms-full.txt) — Logged but crawled normally.llms.txtis typically just an index.- Neither found — Crawls via sitemap + link-following as usual.
Sites with llms-full.txt include Astro, Vue, AWS, LangChain, Crawlee, and Svelte documentation.
Use cases
RAG pipelines. Crawl documentation sites, ingest chunks into a vector database (Pinecone, Weaviate, Qdrant, Chroma, pgvector), and use retrieval-augmented generation to answer questions. The section metadata lets you filter by topic, contentHash makes incremental updates efficient.
Coding assistant context. Feed chunks into Cursor, GitHub Copilot, or Claude so they reference up-to-date documentation when generating code. The heading hierarchy helps the model understand where each piece fits.
Documentation search. Build semantic search over technical docs. breadcrumbs and section provide faceted filtering, clean Markdown ensures results without UI noise.
Change monitoring. Run on a schedule, compare contentHash between runs. Detect API changes, deprecations, or new features in libraries you depend on.
Documentation migration. Export from multiple sources into uniform Markdown for import into a knowledge base or wiki.
Examples
Python docs (Sphinx) — semantic chunks
{"startUrls": [{ "url": "https://docs.python.org/3/library/" }],"maxPages": 50,"outputFormat": "json"}
Outputs ~200 chunks with section paths like "json > Basic Usage" and "os > File and Directory Access".
GitBook site — with JavaScript rendering
{"startUrls": [{ "url": "https://docs.example.gitbook.io" }],"renderJs": true,"framework": "gitbook"}
Full page export — Markdown, no chunking
{"startUrls": [{ "url": "https://react.dev/learn" }],"maxPages": 100,"outputFormat": "markdown"}
Multiple sites in one run
{"startUrls": [{ "url": "https://docs.python.org/3/" },{ "url": "https://docs.pydantic.dev/latest/" }],"maxPages": 200}
Auto-detects Sphinx for Python and MkDocs for Pydantic. The framework field in each chunk tells you which site it came from.
Tips for best results
- Start with
autoframework detection. Only override if extraction looks wrong. - Use
renderJs: trueonly when needed. Try without it first — it is slower and costs more. - Set
maxPageslow initially (10–20) to verify output quality before full crawls. chunkSize: 1500is a good default for RAG. Use 500–800 for precise Q&A, 3000–5000 for summarization.
FAQ
Is it legal to crawl documentation sites?
This Actor accesses only publicly available pages and respects robots.txt via Crawlee's built-in compliance. Developer documentation is published to be read and referenced. The Actor does not bypass authentication, paywalls, or access restrictions. Always check the target site's terms of service.
What if my site is not one of the supported frameworks?
The Actor still works. Unknown frameworks fall back to generic selectors (article, main, [role='main']) with standard noise removal. Output quality is usually good for any site with semantic HTML. Framework detection just makes extraction more precise.
Why are some chunks larger than the target size?
Code blocks are never split. If a code block exceeds the target, the entire block stays in one chunk. Headings are also hard boundaries — a short section becomes its own chunk rather than merging with the next one. Oversized paragraphs (large tables, lists, dense text) are split at structural boundaries — table rows, list items, or line breaks — while preserving table headers in each sub-chunk.
Does llms-full.txt content stay fresh?
The Actor uses llms-full.txt as-is without checking Last-Modified headers. If the file is stale, output may not reflect recent documentation changes. Disable the fast path by running without useSitemap and the Actor will crawl pages directly instead.
Can I run this on a schedule?
Yes. Use Apify Schedules for daily/weekly crawls. Compare contentHash values between runs for automated documentation change detection.
How do I get help?
Open an issue on this Actor's Issues tab in Apify Console. Include the Run ID and the URL that caused the problem.