Pricing

from $10.00 / 1,000 document processeds

RAG Docs Extractor - Documentation to Chunks

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

Pricing

from $10.00 / 1,000 document processeds

Rating

0.0

(0)

Developer

C. K.

Actor stats

Bookmarked

Total users

Monthly active users

21 days ago

Last modified

RAG Docs Extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata (source URL, heading path, token count). No post-processing. Pay per document processed.

What it does

Most doc scrapers give you raw HTML or a single wall of text. You then spend hours cleaning, splitting, and fixing broken context before anything is usable in a vector store. This Actor eliminates that step entirely.

Give it a documentation URL. It crawls the site, strips navigation/chrome, converts to clean markdown, and splits each page into semantically meaningful chunks that respect heading boundaries. Every chunk includes the metadata you need for retrieval: source URL, heading path (so you know where in the doc tree it came from), and token count (so you can plan your embedding budget).

The output drops straight into any vector store or RAG pipeline without cleanup.

Output format

Each chunk in the dataset contains:

Field	Type	Description
`content`	string	The chunk text in markdown or plain text
`heading_path`	string	Hierarchical path, e.g. `"Guide > Installation > Requirements"`
`chunk_index`	integer	Position of this chunk within its source document
`token_count`	integer	Token count (cl100k_base encoding)
`source_url`	string	The URL this chunk was extracted from
`document_title`	string	Page title

Input parameters

Parameter	Type	Default	Description
`startUrl`	string	required	Documentation URL to start crawling from
`maxPages`	integer	50	Maximum pages to crawl
`maxChunkTokens`	integer	512	Target max tokens per chunk
`crawlSameDomain`	boolean	true	Stay within the start URL's domain
`pathPrefix`	string	`""`	Only crawl paths starting with this prefix
`outputFormat`	string	`"markdown"`	`"markdown"` or `"plain_text"`

Example usage

Single page extraction

{
    "startUrl": "https://docs.python.org/3/library/asyncio.html",
    "maxPages": 1
}

Full docs site

{
    "startUrl": "https://fastapi.tiangolo.com/",
    "maxPages": 100,
    "pathPrefix": "/tutorial/",
    "maxChunkTokens": 256
}

Pricing

This Actor uses the pay-per-event model. You are charged per document (page) successfully processed and chunked. No charge for pages that are skipped (empty, non-content).

How the chunking works

HTML cleaning — strips navigation, sidebars, footers, cookie banners, and other non-content elements using a curated set of selectors. Falls back to <article>, <main>, or <body>.
Markdown conversion — converts the cleaned HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
Semantic splitting — splits on heading boundaries first, then paragraph boundaries, then sentence boundaries. Each chunk inherits the heading hierarchy from its position in the document.
Token counting — uses cl100k_base (the encoding used by GPT-4 and most modern embeddings) for accurate token counts.

Responsible use

This Actor respects robots.txt by default (enforced by Crawlee).
It identifies itself with a descriptive User-Agent header so site owners can identify and block it.
Crawlee's built-in autoscaling keeps request rates reasonable and avoids overloading target servers.
You are responsible for ensuring your use complies with the target site's Terms of Service. Only crawl content you have the right to access and process.

Built with

Crawlee for reliable crawling (robots.txt compliant)
BeautifulSoup for HTML parsing
tiktoken for token counting

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Vamsi Krishna

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

Stas Persiianenko

RAG Web Extractor — Clean Markdown, HTML & Chunks

junipr/rag-web-extractor

Extract clean website content for RAG and AI search. Crawl pages, remove boilerplate, preserve structure, and export markdown, HTML, text, JSON, and chunks.

junipr

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Alaricus

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

Harry Schoeller

News to Markdown — RAG-Ready News Chunks API

nexgendata/news-announcements-rag-markdown

Convert news and announcements into RAG-ready Markdown chunks. Clean JSON for PR, media-monitoring teams and AI agents.

NexGenData

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

YouTube Transcript API - RAG Chapters, Summary & Chunks

webdatalabs/youtube-transcript-rag

Turn any YouTube video, playlist, or channel into RAG-ready data: clean transcript, timestamped segments, AI chapters, summary, key quotes, and embeddings-ready chunks. Built for AI agents and RAG pipelines.