Pricing

from $0.50 / 1,000 page processeds

Docs-to-RAG Optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Pricing

from $0.50 / 1,000 page processeds

Rating

0.0

(0)

Developer

Vamsi Krishna

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

Docs to RAG - Documentation to Markdown, JSONL & AI Chunks

Turn public developer documentation into clean, LLM-ready data for RAG pipelines.

This Apify Actor crawls docs websites, removes navigation/sidebar/footer noise, converts pages to Markdown, splits content into semantic chunks, counts tokens, detects duplicates, and exports JSONL files that are easy to load into vector databases and AI search systems.

Best For

Building AI assistants over product or developer documentation
Preparing docs for OpenAI vector stores, Pinecone, Supabase Vector, Weaviate, Qdrant, Chroma, LangChain, and LlamaIndex
Converting Docusaurus, GitBook, MkDocs/Material, MDN-style, and custom docs pages into clean Markdown
Creating stable page and chunk records with content hashes for incremental RAG ingestion

What You Get

Clean Markdown for every processed page
Page JSON records in the pages dataset
Chunk JSON records in the chunks dataset
Default dataset records for easy Apify Console/API export
Consolidated pages.jsonl and chunks.jsonl exports in the key-value store
Token counts for pages and chunks using OpenAI-style tokenization
Header-aware RAG chunks with heading paths and previous/next chunk IDs
SHA-256 content hashes for pages and chunks
Exact duplicate detection with duplicateOf
RAG quality score, warnings, and recommendedAction
Optional per-page Markdown files in key-value store

Why Use This Instead of a Generic Web Scraper?

Generic website scrapers are useful when you need broad website crawling. This Actor is built specifically for documentation-to-RAG workflows:

Docs-specific cleanup for Docusaurus, GitBook, and MkDocs/Material
Header-aware chunks instead of fixed character splitting
embeddingText on every chunk for direct vector database ingestion
Page and chunk JSONL exports for batch pipelines
Duplicate detection to avoid embedding the same page twice
Quality warnings so bad extractions are visible before you embed them
Page-based pricing at $1.00 / 1,000 pages, not per generated chunk

Supported Documentation Platforms

The Actor is optimized for:

Docusaurus
GitBook
MkDocs / Material for MkDocs

Unknown or custom documentation sites use a Readability-based fallback extractor.

Example Use Cases

Crawl https://docusaurus.io/docs and create JSONL chunks for a docs chatbot
Convert GitBook docs into Markdown files for an internal knowledge base
Extract MkDocs/Material documentation into chunk records for Supabase Vector
Deduplicate repeated docs pages before embedding to reduce vector database cost
Build an AI search index from public developer documentation

Example Input

{
  "startUrls": [{ "url": "https://docusaurus.io/docs" }],
  "maxPages": 50,
  "maxDepth": 3,
  "includePatterns": ["^https://docusaurus\\.io/docs"],
  "excludePatterns": ["/blog/"],
  "outputFormats": ["json", "markdown"],
  "chunkingEnabled": true,
  "chunkStrategy": "header-aware",
  "chunkSize": 800,
  "chunkOverlap": 100,
  "deduplicateContent": true,
  "respectRobotsTxt": true,
  "maxConcurrency": 5
}

Example Page Output

{
  "recordType": "page",
  "url": "https://docusaurus.io/docs",
  "canonicalUrl": "https://docusaurus.io/docs",
  "title": "Introduction | Docusaurus",
  "metadata": {
    "docsPlatform": "docusaurus",
    "language": "en"
  },
  "tokenCount": 2189,
  "contentHash": "sha256:...",
  "duplicateOf": null,
  "qualityScore": 95,
  "recommendedAction": "use"
}

Page records also include cleanMarkdown, textContent, headings, codeBlocks, tables, links, qualityWarnings, and crawledAt.

Example Chunk Output

{
  "recordType": "chunk",
  "chunkId": "chunk_abc123_000",
  "sourceUrl": "https://docusaurus.io/docs",
  "pageTitle": "Introduction | Docusaurus",
  "sectionTitle": "Getting started",
  "headingPath": ["Introduction", "Getting started"],
  "embeddingText": "Install Docusaurus and create your first docs site...",
  "tokenCount": 392,
  "chunkIndex": 0,
  "previousChunkId": null,
  "nextChunkId": "chunk_abc123_001",
  "contentHash": "..."
}

Chunk records also include chunkMarkdown, chunkText, and metadata such as docsPlatform, hasCodeBlock, hasTable, sourceLastModified, and sourceContentHash.

Copy-Paste API Example

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

const run = await client.actor('YOUR_USERNAME/docs-to-rag-optimizer').call({
  startUrls: [{ url: 'https://docusaurus.io/docs' }],
  maxPages: 50,
  includePatterns: ['^https://docusaurus\\.io/docs'],
  outputFormats: ['json', 'markdown'],
  chunkingEnabled: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
const chunks = items.filter((item) => item.recordType === 'chunk');

For embeddings, use embeddingText from chunk records and store sourceUrl, pageTitle, headingPath, contentHash, and metadata as vector metadata.

Output Locations

Named dataset pages: one page record per successfully processed page
Named dataset chunks: one chunk record per generated chunk
Key-value store pages.jsonl: consolidated page export
Key-value store chunks.jsonl: consolidated chunk export
Key-value store OUTPUT.json: run summary with counts and export keys
Key-value store pages_<sha256>.md: optional per-page Markdown when outputFormats includes markdown

Pricing

Pricing is based on successfully processed pages:

Base price: $1.00 / 1,000 pages
Starter discount: $0.90 / 1,000 pages
Scale discount: $0.75 / 1,000 pages
Business discount: $0.50 / 1,000 pages

The Actor charges the page-processed event only after a page has been crawled, extracted, converted, saved, and chunked when chunking is enabled.

It does not charge per chunk. Large pages may produce many chunks, but billing remains page-based.

Known Limits

Private docs behind login are not supported in v1.
PDF/DOCX extraction is not included in v1.
JavaScript-heavy docs use a Playwright fallback, but static docs are faster and cheaper.
Exact duplicate detection uses normalized text hashes; near-duplicate detection is not included yet.

Input Fields

startUrls: documentation URLs to start crawling
sitemapUrls: optional XML sitemap URLs
maxPages: maximum successfully processed pages
maxDepth: maximum crawl depth
includePatterns: JavaScript regex strings for allowed URLs
excludePatterns: JavaScript regex strings for blocked URLs
crawlOnlyDocs: skip obvious non-doc paths such as blog, pricing, login, legal
outputFormats: json, markdown, or both
removeSelectors: CSS selectors to remove before extraction
keepSelectors: CSS selectors to restrict extraction to specific areas
preserveCodeBlocks: keep fenced code blocks
preserveTables: keep GitHub-Flavored Markdown tables
preserveLinks: keep links in Markdown and JSON
chunkingEnabled: generate RAG chunks
chunkStrategy: header-aware
chunkSize: target chunk size in tokens
chunkOverlap: approximate chunk overlap in tokens
deduplicateContent: mark exact duplicate pages and skip duplicate chunking
respectRobotsTxt: respect robots.txt rules
maxConcurrency: maximum concurrent requests

URL Pattern Policy

includePatterns and excludePatterns are treated as JavaScript regular expression strings and compiled with new RegExp(pattern).

Example:

{
  "includePatterns": ["^https://developer\\.mozilla\\.org/en-US/docs/Web/JavaScript"],
  "excludePatterns": ["/contributors\\.txt$", "/blog/"]
}

Quality Signals

Each page includes:

qualityScore: deterministic 0-100 score
qualityWarnings: extraction/chunking warnings
recommendedAction: use, review, or skip

These fields help identify pages that are ready for embedding versus pages that need manual review.

Local Development

pnpm install
pnpm run build
pnpm start

Run locally with Apify CLI:

$apify run --purge --input-file INPUT.example.json

Search Keywords

RAG, LLM, AI assistant, documentation scraper, docs scraper, Markdown scraper, JSONL export, vector database, embeddings, chunks, semantic chunking, Docusaurus scraper, GitBook scraper, MkDocs scraper, Material for MkDocs, developer docs, AI search, LangChain, LlamaIndex, OpenAI, Pinecone, Supabase Vector.

Actor LLM Optimizer — Cut Token Costs 40-70% on Actor Output

ryanclinton/actor-llm-optimizer

Actor Llm Optimizer. Available on the Apify Store with pay-per-event pricing.

Ryan Clinton

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) and output chunked markdown ready for RAG/vector DB ingestion. Splits by heading hierarchy, strips nav/sidebar chrome.

Stas Persiianenko

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Hastin S.

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

(1)

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

RAG Pipeline Data Collector

scraper_guru/rag-pipeline-data-collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

LIAICHI MUSTAPHA

Train Your Local LLM for Business & Finance - DataPro

omissive_aurora/train-your-local-llm-for-business-finance---datapro

Train your local LLM for business and finance with Ultimate DataPro. Scrapes live stock prices, SEC EDGAR filings, options chains, and financial news - then auto-builds Alpaca/ShareGPT fine-tuning datasets. Export as JSONL, CSV, or Parquet. Push to HuggingFace Hub.

d.leigh hunte

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

(1)

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.