RAG Docs Extractor - Documentation to Chunks avatar

RAG Docs Extractor - Documentation to Chunks

Pricing

from $10.00 / 1,000 document processeds

Go to Apify Store
RAG Docs Extractor - Documentation to Chunks

RAG Docs Extractor - Documentation to Chunks

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

Pricing

from $10.00 / 1,000 document processeds

Rating

0.0

(0)

Developer

C. K.

C. K.

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 hours ago

Last modified

Share

RAG Docs Extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata (source URL, heading path, token count). No post-processing. Pay per document processed.

What it does

Most doc scrapers give you raw HTML or a single wall of text. You then spend hours cleaning, splitting, and fixing broken context before anything is usable in a vector store. This Actor eliminates that step entirely.

Give it a documentation URL. It crawls the site, strips navigation/chrome, converts to clean markdown, and splits each page into semantically meaningful chunks that respect heading boundaries. Every chunk includes the metadata you need for retrieval: source URL, heading path (so you know where in the doc tree it came from), and token count (so you can plan your embedding budget).

The output drops straight into any vector store or RAG pipeline without cleanup.

Output format

Each chunk in the dataset contains:

FieldTypeDescription
contentstringThe chunk text in markdown or plain text
heading_pathstringHierarchical path, e.g. "Guide > Installation > Requirements"
chunk_indexintegerPosition of this chunk within its source document
token_countintegerToken count (cl100k_base encoding)
source_urlstringThe URL this chunk was extracted from
document_titlestringPage title

Input parameters

ParameterTypeDefaultDescription
startUrlstringrequiredDocumentation URL to start crawling from
maxPagesinteger50Maximum pages to crawl
maxChunkTokensinteger512Target max tokens per chunk
crawlSameDomainbooleantrueStay within the start URL's domain
pathPrefixstring""Only crawl paths starting with this prefix
outputFormatstring"markdown""markdown" or "plain_text"

Example usage

Single page extraction

{
"startUrl": "https://docs.python.org/3/library/asyncio.html",
"maxPages": 1
}

Full docs site

{
"startUrl": "https://fastapi.tiangolo.com/",
"maxPages": 100,
"pathPrefix": "/tutorial/",
"maxChunkTokens": 256
}

Pricing

This Actor uses the pay-per-event model. You are charged per document (page) successfully processed and chunked. No charge for pages that are skipped (empty, non-content).

How the chunking works

  1. HTML cleaning — strips navigation, sidebars, footers, cookie banners, and other non-content elements using a curated set of selectors. Falls back to <article>, <main>, or <body>.
  2. Markdown conversion — converts the cleaned HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
  3. Semantic splitting — splits on heading boundaries first, then paragraph boundaries, then sentence boundaries. Each chunk inherits the heading hierarchy from its position in the document.
  4. Token counting — uses cl100k_base (the encoding used by GPT-4 and most modern embeddings) for accurate token counts.

Responsible use

  • This Actor respects robots.txt by default (enforced by Crawlee).
  • It identifies itself with a descriptive User-Agent header so site owners can identify and block it.
  • Crawlee's built-in autoscaling keeps request rates reasonable and avoids overloading target servers.
  • You are responsible for ensuring your use complies with the target site's Terms of Service. Only crawl content you have the right to access and process.

Built with