Docs-to-RAG Optimizer avatar

Docs-to-RAG Optimizer

Pricing

from $0.50 / 1,000 page processeds

Go to Apify Store
Docs-to-RAG Optimizer

Docs-to-RAG Optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Pricing

from $0.50 / 1,000 page processeds

Rating

0.0

(0)

Developer

Vamsi Krishna

Vamsi Krishna

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Docs to RAG - Documentation to Markdown, JSONL & AI Chunks

Turn public developer documentation into clean, LLM-ready data for RAG pipelines.

This Apify Actor crawls docs websites, removes navigation/sidebar/footer noise, converts pages to Markdown, splits content into semantic chunks, counts tokens, detects duplicates, and exports JSONL files that are easy to load into vector databases and AI search systems.

Best For

  • Building AI assistants over product or developer documentation
  • Preparing docs for OpenAI vector stores, Pinecone, Supabase Vector, Weaviate, Qdrant, Chroma, LangChain, and LlamaIndex
  • Converting Docusaurus, GitBook, MkDocs/Material, MDN-style, and custom docs pages into clean Markdown
  • Creating stable page and chunk records with content hashes for incremental RAG ingestion

What You Get

  • Clean Markdown for every processed page
  • Page JSON records in the pages dataset
  • Chunk JSON records in the chunks dataset
  • Default dataset records for easy Apify Console/API export
  • Consolidated pages.jsonl and chunks.jsonl exports in the key-value store
  • Token counts for pages and chunks using OpenAI-style tokenization
  • Header-aware RAG chunks with heading paths and previous/next chunk IDs
  • SHA-256 content hashes for pages and chunks
  • Exact duplicate detection with duplicateOf
  • RAG quality score, warnings, and recommendedAction
  • Optional per-page Markdown files in key-value store

Why Use This Instead of a Generic Web Scraper?

Generic website scrapers are useful when you need broad website crawling. This Actor is built specifically for documentation-to-RAG workflows:

  • Docs-specific cleanup for Docusaurus, GitBook, and MkDocs/Material
  • Header-aware chunks instead of fixed character splitting
  • embeddingText on every chunk for direct vector database ingestion
  • Page and chunk JSONL exports for batch pipelines
  • Duplicate detection to avoid embedding the same page twice
  • Quality warnings so bad extractions are visible before you embed them
  • Page-based pricing at $1.00 / 1,000 pages, not per generated chunk

Supported Documentation Platforms

The Actor is optimized for:

  • Docusaurus
  • GitBook
  • MkDocs / Material for MkDocs

Unknown or custom documentation sites use a Readability-based fallback extractor.

Example Use Cases

  • Crawl https://docusaurus.io/docs and create JSONL chunks for a docs chatbot
  • Convert GitBook docs into Markdown files for an internal knowledge base
  • Extract MkDocs/Material documentation into chunk records for Supabase Vector
  • Deduplicate repeated docs pages before embedding to reduce vector database cost
  • Build an AI search index from public developer documentation

Example Input

{
"startUrls": [{ "url": "https://docusaurus.io/docs" }],
"maxPages": 50,
"maxDepth": 3,
"includePatterns": ["^https://docusaurus\\.io/docs"],
"excludePatterns": ["/blog/"],
"outputFormats": ["json", "markdown"],
"chunkingEnabled": true,
"chunkStrategy": "header-aware",
"chunkSize": 800,
"chunkOverlap": 100,
"deduplicateContent": true,
"respectRobotsTxt": true,
"maxConcurrency": 5
}

Example Page Output

{
"recordType": "page",
"url": "https://docusaurus.io/docs",
"canonicalUrl": "https://docusaurus.io/docs",
"title": "Introduction | Docusaurus",
"metadata": {
"docsPlatform": "docusaurus",
"language": "en"
},
"tokenCount": 2189,
"contentHash": "sha256:...",
"duplicateOf": null,
"qualityScore": 95,
"recommendedAction": "use"
}

Page records also include cleanMarkdown, textContent, headings, codeBlocks, tables, links, qualityWarnings, and crawledAt.

Example Chunk Output

{
"recordType": "chunk",
"chunkId": "chunk_abc123_000",
"sourceUrl": "https://docusaurus.io/docs",
"pageTitle": "Introduction | Docusaurus",
"sectionTitle": "Getting started",
"headingPath": ["Introduction", "Getting started"],
"embeddingText": "Install Docusaurus and create your first docs site...",
"tokenCount": 392,
"chunkIndex": 0,
"previousChunkId": null,
"nextChunkId": "chunk_abc123_001",
"contentHash": "..."
}

Chunk records also include chunkMarkdown, chunkText, and metadata such as docsPlatform, hasCodeBlock, hasTable, sourceLastModified, and sourceContentHash.

Copy-Paste API Example

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('YOUR_USERNAME/docs-to-rag-optimizer').call({
startUrls: [{ url: 'https://docusaurus.io/docs' }],
maxPages: 50,
includePatterns: ['^https://docusaurus\\.io/docs'],
outputFormats: ['json', 'markdown'],
chunkingEnabled: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
const chunks = items.filter((item) => item.recordType === 'chunk');

For embeddings, use embeddingText from chunk records and store sourceUrl, pageTitle, headingPath, contentHash, and metadata as vector metadata.

Output Locations

  • Named dataset pages: one page record per successfully processed page
  • Named dataset chunks: one chunk record per generated chunk
  • Key-value store pages.jsonl: consolidated page export
  • Key-value store chunks.jsonl: consolidated chunk export
  • Key-value store OUTPUT.json: run summary with counts and export keys
  • Key-value store pages_<sha256>.md: optional per-page Markdown when outputFormats includes markdown

Pricing

Pricing is based on successfully processed pages:

  • Base price: $1.00 / 1,000 pages
  • Starter discount: $0.90 / 1,000 pages
  • Scale discount: $0.75 / 1,000 pages
  • Business discount: $0.50 / 1,000 pages

The Actor charges the page-processed event only after a page has been crawled, extracted, converted, saved, and chunked when chunking is enabled.

It does not charge per chunk. Large pages may produce many chunks, but billing remains page-based.

Known Limits

  • Private docs behind login are not supported in v1.
  • PDF/DOCX extraction is not included in v1.
  • JavaScript-heavy docs use a Playwright fallback, but static docs are faster and cheaper.
  • Exact duplicate detection uses normalized text hashes; near-duplicate detection is not included yet.

Input Fields

  • startUrls: documentation URLs to start crawling
  • sitemapUrls: optional XML sitemap URLs
  • maxPages: maximum successfully processed pages
  • maxDepth: maximum crawl depth
  • includePatterns: JavaScript regex strings for allowed URLs
  • excludePatterns: JavaScript regex strings for blocked URLs
  • crawlOnlyDocs: skip obvious non-doc paths such as blog, pricing, login, legal
  • outputFormats: json, markdown, or both
  • removeSelectors: CSS selectors to remove before extraction
  • keepSelectors: CSS selectors to restrict extraction to specific areas
  • preserveCodeBlocks: keep fenced code blocks
  • preserveTables: keep GitHub-Flavored Markdown tables
  • preserveLinks: keep links in Markdown and JSON
  • chunkingEnabled: generate RAG chunks
  • chunkStrategy: header-aware
  • chunkSize: target chunk size in tokens
  • chunkOverlap: approximate chunk overlap in tokens
  • deduplicateContent: mark exact duplicate pages and skip duplicate chunking
  • respectRobotsTxt: respect robots.txt rules
  • maxConcurrency: maximum concurrent requests

URL Pattern Policy

includePatterns and excludePatterns are treated as JavaScript regular expression strings and compiled with new RegExp(pattern).

Example:

{
"includePatterns": ["^https://developer\\.mozilla\\.org/en-US/docs/Web/JavaScript"],
"excludePatterns": ["/contributors\\.txt$", "/blog/"]
}

Quality Signals

Each page includes:

  • qualityScore: deterministic 0-100 score
  • qualityWarnings: extraction/chunking warnings
  • recommendedAction: use, review, or skip

These fields help identify pages that are ready for embedding versus pages that need manual review.

Local Development

pnpm install
pnpm run build
pnpm start

Run locally with Apify CLI:

$apify run --purge --input-file INPUT.example.json

Search Keywords

RAG, LLM, AI assistant, documentation scraper, docs scraper, Markdown scraper, JSONL export, vector database, embeddings, chunks, semantic chunking, Docusaurus scraper, GitBook scraper, MkDocs scraper, Material for MkDocs, developer docs, AI search, LangChain, LlamaIndex, OpenAI, Pinecone, Supabase Vector.