Docs-to-RAG Optimizer
Pricing
from $0.50 / 1,000 page processeds
Docs-to-RAG Optimizer
Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.
Pricing
from $0.50 / 1,000 page processeds
Rating
0.0
(0)
Developer
Vamsi Krishna
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Docs to RAG - Documentation to Markdown, JSONL & AI Chunks
Turn public developer documentation into clean, LLM-ready data for RAG pipelines.
This Apify Actor crawls docs websites, removes navigation/sidebar/footer noise, converts pages to Markdown, splits content into semantic chunks, counts tokens, detects duplicates, and exports JSONL files that are easy to load into vector databases and AI search systems.
Best For
- Building AI assistants over product or developer documentation
- Preparing docs for OpenAI vector stores, Pinecone, Supabase Vector, Weaviate, Qdrant, Chroma, LangChain, and LlamaIndex
- Converting Docusaurus, GitBook, MkDocs/Material, MDN-style, and custom docs pages into clean Markdown
- Creating stable page and chunk records with content hashes for incremental RAG ingestion
What You Get
- Clean Markdown for every processed page
- Page JSON records in the
pagesdataset - Chunk JSON records in the
chunksdataset - Default dataset records for easy Apify Console/API export
- Consolidated
pages.jsonlandchunks.jsonlexports in the key-value store - Token counts for pages and chunks using OpenAI-style tokenization
- Header-aware RAG chunks with heading paths and previous/next chunk IDs
- SHA-256 content hashes for pages and chunks
- Exact duplicate detection with
duplicateOf - RAG quality score, warnings, and
recommendedAction - Optional per-page Markdown files in key-value store
Why Use This Instead of a Generic Web Scraper?
Generic website scrapers are useful when you need broad website crawling. This Actor is built specifically for documentation-to-RAG workflows:
- Docs-specific cleanup for Docusaurus, GitBook, and MkDocs/Material
- Header-aware chunks instead of fixed character splitting
embeddingTexton every chunk for direct vector database ingestion- Page and chunk JSONL exports for batch pipelines
- Duplicate detection to avoid embedding the same page twice
- Quality warnings so bad extractions are visible before you embed them
- Page-based pricing at
$1.00 / 1,000 pages, not per generated chunk
Supported Documentation Platforms
The Actor is optimized for:
- Docusaurus
- GitBook
- MkDocs / Material for MkDocs
Unknown or custom documentation sites use a Readability-based fallback extractor.
Example Use Cases
- Crawl
https://docusaurus.io/docsand create JSONL chunks for a docs chatbot - Convert GitBook docs into Markdown files for an internal knowledge base
- Extract MkDocs/Material documentation into chunk records for Supabase Vector
- Deduplicate repeated docs pages before embedding to reduce vector database cost
- Build an AI search index from public developer documentation
Example Input
{"startUrls": [{ "url": "https://docusaurus.io/docs" }],"maxPages": 50,"maxDepth": 3,"includePatterns": ["^https://docusaurus\\.io/docs"],"excludePatterns": ["/blog/"],"outputFormats": ["json", "markdown"],"chunkingEnabled": true,"chunkStrategy": "header-aware","chunkSize": 800,"chunkOverlap": 100,"deduplicateContent": true,"respectRobotsTxt": true,"maxConcurrency": 5}
Example Page Output
{"recordType": "page","url": "https://docusaurus.io/docs","canonicalUrl": "https://docusaurus.io/docs","title": "Introduction | Docusaurus","metadata": {"docsPlatform": "docusaurus","language": "en"},"tokenCount": 2189,"contentHash": "sha256:...","duplicateOf": null,"qualityScore": 95,"recommendedAction": "use"}
Page records also include cleanMarkdown, textContent, headings, codeBlocks, tables, links, qualityWarnings, and crawledAt.
Example Chunk Output
{"recordType": "chunk","chunkId": "chunk_abc123_000","sourceUrl": "https://docusaurus.io/docs","pageTitle": "Introduction | Docusaurus","sectionTitle": "Getting started","headingPath": ["Introduction", "Getting started"],"embeddingText": "Install Docusaurus and create your first docs site...","tokenCount": 392,"chunkIndex": 0,"previousChunkId": null,"nextChunkId": "chunk_abc123_001","contentHash": "..."}
Chunk records also include chunkMarkdown, chunkText, and metadata such as docsPlatform, hasCodeBlock, hasTable, sourceLastModified, and sourceContentHash.
Copy-Paste API Example
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: process.env.APIFY_TOKEN });const run = await client.actor('YOUR_USERNAME/docs-to-rag-optimizer').call({startUrls: [{ url: 'https://docusaurus.io/docs' }],maxPages: 50,includePatterns: ['^https://docusaurus\\.io/docs'],outputFormats: ['json', 'markdown'],chunkingEnabled: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();const chunks = items.filter((item) => item.recordType === 'chunk');
For embeddings, use embeddingText from chunk records and store sourceUrl, pageTitle, headingPath, contentHash, and metadata as vector metadata.
Output Locations
- Named dataset
pages: one page record per successfully processed page - Named dataset
chunks: one chunk record per generated chunk - Key-value store
pages.jsonl: consolidated page export - Key-value store
chunks.jsonl: consolidated chunk export - Key-value store
OUTPUT.json: run summary with counts and export keys - Key-value store
pages_<sha256>.md: optional per-page Markdown whenoutputFormatsincludesmarkdown
Pricing
Pricing is based on successfully processed pages:
- Base price:
$1.00 / 1,000 pages - Starter discount:
$0.90 / 1,000 pages - Scale discount:
$0.75 / 1,000 pages - Business discount:
$0.50 / 1,000 pages
The Actor charges the page-processed event only after a page has been crawled, extracted, converted, saved, and chunked when chunking is enabled.
It does not charge per chunk. Large pages may produce many chunks, but billing remains page-based.
Known Limits
- Private docs behind login are not supported in v1.
- PDF/DOCX extraction is not included in v1.
- JavaScript-heavy docs use a Playwright fallback, but static docs are faster and cheaper.
- Exact duplicate detection uses normalized text hashes; near-duplicate detection is not included yet.
Input Fields
startUrls: documentation URLs to start crawlingsitemapUrls: optional XML sitemap URLsmaxPages: maximum successfully processed pagesmaxDepth: maximum crawl depthincludePatterns: JavaScript regex strings for allowed URLsexcludePatterns: JavaScript regex strings for blocked URLscrawlOnlyDocs: skip obvious non-doc paths such as blog, pricing, login, legaloutputFormats:json,markdown, or bothremoveSelectors: CSS selectors to remove before extractionkeepSelectors: CSS selectors to restrict extraction to specific areaspreserveCodeBlocks: keep fenced code blockspreserveTables: keep GitHub-Flavored Markdown tablespreserveLinks: keep links in Markdown and JSONchunkingEnabled: generate RAG chunkschunkStrategy:header-awarechunkSize: target chunk size in tokenschunkOverlap: approximate chunk overlap in tokensdeduplicateContent: mark exact duplicate pages and skip duplicate chunkingrespectRobotsTxt: respect robots.txt rulesmaxConcurrency: maximum concurrent requests
URL Pattern Policy
includePatterns and excludePatterns are treated as JavaScript regular expression strings and compiled with new RegExp(pattern).
Example:
{"includePatterns": ["^https://developer\\.mozilla\\.org/en-US/docs/Web/JavaScript"],"excludePatterns": ["/contributors\\.txt$", "/blog/"]}
Quality Signals
Each page includes:
qualityScore: deterministic 0-100 scorequalityWarnings: extraction/chunking warningsrecommendedAction:use,review, orskip
These fields help identify pages that are ready for embedding versus pages that need manual review.
Local Development
pnpm installpnpm run buildpnpm start
Run locally with Apify CLI:
$apify run --purge --input-file INPUT.example.json
Search Keywords
RAG, LLM, AI assistant, documentation scraper, docs scraper, Markdown scraper, JSONL export, vector database, embeddings, chunks, semantic chunking, Docusaurus scraper, GitBook scraper, MkDocs scraper, Material for MkDocs, developer docs, AI search, LangChain, LlamaIndex, OpenAI, Pinecone, Supabase Vector.