Pricing

Pay per usage

Documentation Crawler for RAG

Specialized crawler for developer documentation sites. Detects frameworks (Docusaurus, GitBook, ReadTheDocs, MkDocs, Sphinx), extracts clean content, and outputs semantically chunked Markdown optimized for RAG pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Izz

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

What is Documentation Crawler for RAG?

This Actor crawls developer documentation websites and converts them into semantically chunked Markdown — ready for RAG pipelines, vector databases, and LLM applications. It automatically detects whether a site is built with Docusaurus, GitBook, ReadTheDocs, MkDocs, or Sphinx, and uses framework-specific selectors to extract only the documentation content — no sidebars, navigation, footers, or version badges.

Generic crawlers dump entire pages into one text blob, including all the UI noise. That degrades your RAG retrieval quality. This Actor solves that by understanding how documentation frameworks structure their HTML, and extracting only what matters.

What can this Actor do?

Detect documentation frameworks automatically. Analyzes meta tags, CSS classes, URL patterns, and HTML structure to identify the framework. Each framework gets its own content selectors — Sphinx uses .body, Docusaurus uses .theme-doc-markdown, MkDocs uses .md-content__inner, and so on.
Split content into semantic chunks. Splits at heading boundaries (H1 → H2 → H3) instead of by character count. Each chunk carries its heading path as metadata — e.g., "section": "API > Authentication > OAuth". Code blocks are never split across chunks.
Preserve documentation hierarchy. Each chunk includes breadcrumbs (page position in the docs tree) and section paths (heading hierarchy within the page). Your RAG system can filter by topic or weight results by depth.
Use llms.txt when available. Checks for /llms-full.txt before crawling. If found, uses it directly — one HTTP request instead of crawling hundreds of pages. About 52% of major docs sites support this.
Deduplicate with content hashes. SHA-256 hash per chunk. Compare hashes between crawl runs to detect changes without diffing full text.
Work on any site. Unknown frameworks fall back to generic selectors (article, main, [role='main']). The Actor is not limited to the five supported frameworks.

What data does this Actor extract?

Field	Example	Description
`url`	`https://docs.python.org/3/library/json.html`	Source page URL
`title`	`json — JSON encoder and decoder`	Page title (cleaned)
`content`	`## Basic Usage\n\njson.dump(obj, fp)...`	Clean Markdown, one chunk per item
`breadcrumbs`	`["Python Standard Library", "Internet Data Handling"]`	Page position in the docs tree
`framework`	`sphinx`	Detected framework
`section`	`json > Basic Usage`	Heading path within the page
`contentHash`	`a9925077f5be3d02...`	SHA-256 for deduplication
`chunkIndex` / `totalChunks`	`2` / `22`	Chunk position within the page

How to crawl documentation for your RAG pipeline

Click Try for free to open the Actor in Apify Console.
Enter one or more documentation URLs in Start URLs — e.g., https://docs.python.org/3/library/ or https://react.dev/learn.
Set Max Pages (default: 100). The Actor discovers pages via sitemap.xml and link-following, so you only need the root URL.
Choose Output Format: json for semantic chunks (best for RAG), markdown for full pages, or jsonl for streaming.
Click Start. The Actor detects the framework, crawls, chunks, and pushes results to the dataset.
Download from the Dataset tab as JSON, CSV, or Excel — or fetch via API.

Using the Apify API

curl -X POST "https://api.apify.com/v2/acts/liquid_bark~docs-crawler-for-rag/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [{ "url": "https://docs.python.org/3/" }],
    "maxPages": 50,
    "outputFormat": "json",
    "chunkSize": 1500
  }'

Also available via Python SDK and JavaScript SDK.

Integrating with a vector database

The JSON output is designed for direct ingestion into Pinecone, Weaviate, Qdrant, Chroma, or pgvector:

Run the Actor to crawl a documentation site.
Fetch the dataset: GET /v2/datasets/{datasetId}/items.
For each item: embed content, use contentHash as dedup key, store section + breadcrumbs as metadata.
On subsequent crawls, compare hashes to upsert only changed chunks.

How much does it cost to crawl documentation?

$2.00 per 1,000 output items (pay-per-event). One item = one chunk in JSON mode or one page in Markdown mode. A typical page produces 3–5 chunks.

Scenario	Pages	Items	Event cost	Compute	Total
Single library (e.g., Python `json`)	1	~22	~$0.04	<$0.01	~$0.05
Small docs site	50	~200	~$0.40	~$0.03	~$0.43
Medium docs site	100	~400	~$0.80	~$0.05	~$0.85
Large docs site	500	~2,000	~$4.00	~$0.25	~$4.25
Full page export (Markdown)	100	100	~$0.20	~$0.05	~$0.25

Sites with renderJs: true use more compute (headless browser).

Input

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to start crawling. Follows links within the same domain. Max 100 URLs.
`maxPages`	integer	100	Maximum pages to crawl. 0 = unlimited (capped at 50,000).
`outputFormat`	string	`json`	`json` (chunks + metadata), `markdown` (full pages), `jsonl` (streaming).
`chunkSize`	integer	1500	Target chunk size in characters (200–10,000). May exceed target to keep code blocks intact.
`framework`	string	`auto`	`auto`, `docusaurus`, `gitbook`, `readthedocs`, `mkdocs`, `sphinx`. Override if auto-detection fails.
`renderJs`	boolean	false	Headless browser for JS-heavy sites (GitBook, newer Docusaurus). Slower, more compute.
`useSitemap`	boolean	true	Discover pages via sitemap.xml. Disable if sitemap is broken.

Output

Real example from crawling Python json docs — this single page produced 22 chunks:

{
  "url": "https://docs.python.org/3/library/json.html",
  "title": "json — JSON encoder and decoder",
  "content": "## Basic Usage\n\njson.dump(*obj*, *fp*, ***, *skipkeys=False*, ...)\n\nSerialize *obj* as a JSON formatted stream to *fp*...",
  "metadata": {
    "breadcrumbs": [],
    "framework": "sphinx",
    "section": "json — JSON encoder and decoder > Basic Usage",
    "contentHash": "a9925077f5be3d02e8f1c4a7b6d8e9f0...",
    "chunkIndex": 2,
    "totalChunks": 22
  }
}

Each chunk knows its section path, framework, and position. Your RAG pipeline can use section to filter by topic and contentHash for incremental updates.

Supported frameworks

Framework	Detected via	Extracts from	Example sites
Docusaurus	Meta generator, `data-theme`, theme CSS classes	`.theme-doc-markdown`	React, Crawlee, Jest
GitBook	URL domain, `gitbook-root` class, OpenGraph	`.page-body`, `.markdown-section`	Startup docs (often needs `renderJs: true`)
ReadTheDocs	URL domain, `rst-content` class	`.rst-content`	Django, Flask, Python projects
MkDocs	Generator meta tag, `md-content`, `data-md-component`	`.md-content__inner`	Pydantic, FastAPI
Sphinx	Generator meta tag, `sphinxsidebar`, `_static/` links	`.body`	Python stdlib, Linux kernel
Unknown	—	`article`, `main`, `[role='main']`	Any site with semantic HTML

For each framework, the Actor removes framework-specific noise (sidebars, TOC, pagination, edit links, version selectors) before extracting content. This is the key difference to generic crawlers: a Docusaurus sidebar and a Sphinx sidebar have completely different HTML — the Actor knows how to strip both.

If you know which framework your target site uses, you can set the framework parameter to skip auto-detection. This is useful for sites where detection signals are hidden behind client-side rendering.

llms.txt support

Before crawling, the Actor checks if the site provides llms.txt files:

llms-full.txt found — Uses it directly as content source. No crawling needed — one HTTP request instead of hundreds of pages. The content is chunked using the same semantic splitting as crawled pages.
llms.txt found (no llms-full.txt) — Logged but crawled normally. llms.txt is typically just an index.
Neither found — Crawls via sitemap + link-following as usual.

Sites with llms-full.txt include Astro, Vue, AWS, LangChain, Crawlee, and Svelte documentation.

Use cases

RAG pipelines. Crawl documentation sites, ingest chunks into a vector database (Pinecone, Weaviate, Qdrant, Chroma, pgvector), and use retrieval-augmented generation to answer questions. The section metadata lets you filter by topic, contentHash makes incremental updates efficient.

Coding assistant context. Feed chunks into Cursor, GitHub Copilot, or Claude so they reference up-to-date documentation when generating code. The heading hierarchy helps the model understand where each piece fits.

Documentation search. Build semantic search over technical docs. breadcrumbs and section provide faceted filtering, clean Markdown ensures results without UI noise.

Change monitoring. Run on a schedule, compare contentHash between runs. Detect API changes, deprecations, or new features in libraries you depend on.

Documentation migration. Export from multiple sources into uniform Markdown for import into a knowledge base or wiki.

Examples

Python docs (Sphinx) — semantic chunks

{
  "startUrls": [{ "url": "https://docs.python.org/3/library/" }],
  "maxPages": 50,
  "outputFormat": "json"
}

Outputs ~200 chunks with section paths like "json > Basic Usage" and "os > File and Directory Access".

GitBook site — with JavaScript rendering

{
  "startUrls": [{ "url": "https://docs.example.gitbook.io" }],
  "renderJs": true,
  "framework": "gitbook"
}

Full page export — Markdown, no chunking

{
  "startUrls": [{ "url": "https://react.dev/learn" }],
  "maxPages": 100,
  "outputFormat": "markdown"
}

Multiple sites in one run

{
  "startUrls": [
    { "url": "https://docs.python.org/3/" },
    { "url": "https://docs.pydantic.dev/latest/" }
  ],
  "maxPages": 200
}

Auto-detects Sphinx for Python and MkDocs for Pydantic. The framework field in each chunk tells you which site it came from.

Tips for best results

Start with auto framework detection. Only override if extraction looks wrong.
Use renderJs: true only when needed. Try without it first — it is slower and costs more.
Set maxPages low initially (10–20) to verify output quality before full crawls.
chunkSize: 1500 is a good default for RAG. Use 500–800 for precise Q&A, 3000–5000 for summarization.

FAQ

Is it legal to crawl documentation sites?

This Actor accesses only publicly available pages and respects robots.txt via Crawlee's built-in compliance. Developer documentation is published to be read and referenced. The Actor does not bypass authentication, paywalls, or access restrictions. Always check the target site's terms of service.

What if my site is not one of the supported frameworks?

The Actor still works. Unknown frameworks fall back to generic selectors (article, main, [role='main']) with standard noise removal. Output quality is usually good for any site with semantic HTML. Framework detection just makes extraction more precise.

Why are some chunks larger than the target size?

Code blocks are never split. If a code block exceeds the target, the entire block stays in one chunk. Headings are also hard boundaries — a short section becomes its own chunk rather than merging with the next one. Oversized paragraphs (large tables, lists, dense text) are split at structural boundaries — table rows, list items, or line breaks — while preserving table headers in each sub-chunk.

Does llms-full.txt content stay fresh?

The Actor uses llms-full.txt as-is without checking Last-Modified headers. If the file is stale, output may not reflect recent documentation changes. Disable the fast path by running without useSitemap and the Actor will crawl pages directly instead.

Can I run this on a schedule?

Yes. Use Apify Schedules for daily/weekly crawls. Compare contentHash values between runs for automated documentation change detection.

How do I get help?

Open an issue on this Actor's Issues tab in Apify Console. Include the Run ID and the URL that caused the problem.

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

Stas Persiianenko

RAG Knowledge Loader

botflowtech/rag-knowledge-loader

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

BotFlowTech

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Alaricus

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Hastin S.

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.