RAG Docs Extractor - Documentation to Chunks
Pricing
from $10.00 / 1,000 document processeds
RAG Docs Extractor - Documentation to Chunks
Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.
Pricing
from $10.00 / 1,000 document processeds
Rating
0.0
(0)
Developer
C. K.
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 hours ago
Last modified
Categories
Share
RAG Docs Extractor
Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata (source URL, heading path, token count). No post-processing. Pay per document processed.
What it does
Most doc scrapers give you raw HTML or a single wall of text. You then spend hours cleaning, splitting, and fixing broken context before anything is usable in a vector store. This Actor eliminates that step entirely.
Give it a documentation URL. It crawls the site, strips navigation/chrome, converts to clean markdown, and splits each page into semantically meaningful chunks that respect heading boundaries. Every chunk includes the metadata you need for retrieval: source URL, heading path (so you know where in the doc tree it came from), and token count (so you can plan your embedding budget).
The output drops straight into any vector store or RAG pipeline without cleanup.
Output format
Each chunk in the dataset contains:
| Field | Type | Description |
|---|---|---|
content | string | The chunk text in markdown or plain text |
heading_path | string | Hierarchical path, e.g. "Guide > Installation > Requirements" |
chunk_index | integer | Position of this chunk within its source document |
token_count | integer | Token count (cl100k_base encoding) |
source_url | string | The URL this chunk was extracted from |
document_title | string | Page title |
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrl | string | required | Documentation URL to start crawling from |
maxPages | integer | 50 | Maximum pages to crawl |
maxChunkTokens | integer | 512 | Target max tokens per chunk |
crawlSameDomain | boolean | true | Stay within the start URL's domain |
pathPrefix | string | "" | Only crawl paths starting with this prefix |
outputFormat | string | "markdown" | "markdown" or "plain_text" |
Example usage
Single page extraction
{"startUrl": "https://docs.python.org/3/library/asyncio.html","maxPages": 1}
Full docs site
{"startUrl": "https://fastapi.tiangolo.com/","maxPages": 100,"pathPrefix": "/tutorial/","maxChunkTokens": 256}
Pricing
This Actor uses the pay-per-event model. You are charged per document (page) successfully processed and chunked. No charge for pages that are skipped (empty, non-content).
How the chunking works
- HTML cleaning — strips navigation, sidebars, footers, cookie banners, and other non-content elements using a curated set of selectors. Falls back to
<article>,<main>, or<body>. - Markdown conversion — converts the cleaned HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
- Semantic splitting — splits on heading boundaries first, then paragraph boundaries, then sentence boundaries. Each chunk inherits the heading hierarchy from its position in the document.
- Token counting — uses
cl100k_base(the encoding used by GPT-4 and most modern embeddings) for accurate token counts.
Responsible use
- This Actor respects
robots.txtby default (enforced by Crawlee). - It identifies itself with a descriptive
User-Agentheader so site owners can identify and block it. - Crawlee's built-in autoscaling keeps request rates reasonable and avoids overloading target servers.
- You are responsible for ensuring your use complies with the target site's Terms of Service. Only crawl content you have the right to access and process.
Built with
- Crawlee for reliable crawling (robots.txt compliant)
- BeautifulSoup for HTML parsing
- tiktoken for token counting