RAG-Ready Documentation Scraper avatar

RAG-Ready Documentation Scraper

Pricing

from $3.99 / 1,000 results

Go to Apify Store
RAG-Ready Documentation Scraper

RAG-Ready Documentation Scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

Alaricus

Alaricus

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

RAG-Ready Documentation Scraper Actor

What does the RAG-Ready Documentation Scraper do?

The RAG-Ready Documentation Scraper is a high-performance web crawler and content parser designed specifically for LLM, vector database, and Retrieval-Augmented Generation (RAG) pipelines. It extracts clean, structured, framework-optimized Markdown text from documentation sites and standard websites, stripping out all clutter (navigation panels, header menus, search boxes, cookie consent forms, and footer noise) to leave only pure content body.

To make the outputs immediately ready for ingestion, the actor performs semantic paragraph-based chunking with configurable character sizes and contextual overlaps. It also parses XML sitemaps automatically to crawl entire documentation trees with zero extra configuration.

Key Features

  • 🧹 Boilerplate Layout Scrubbing: Automatically detects and isolates main documentation content layouts. Eliminates menus, headers, sidebars, footer links, and cookie alerts.
  • 🧩 Semantic Chunking: Splits extracted Markdown documents cleanly on paragraph boundaries. If any single paragraph is too large, it is split sentences/character-wise, with a configurable context overlap to avoid losing context.
  • 📄 XML Sitemap Parsing: Simply supply a sitemap.xml URL as a starting point and the scraper will auto-discover and queue all links in the sitemap.
  • 📦 Framework Adaptation: Built-in optimized container detection for popular documentation builders:
    • Docusaurus
    • GitBook
    • Sphinx
    • ReadTheDocs
    • Auto-Detect (for any generic blog, API reference, or standard page)
  • 🖼️ Image & Link Toggles: Include or strip images (![alt](url)) and hyperlinks ([text](url)) on demand depending on your RAG embedding requirements.

Input Parameters

ParameterTypeDefaultDescription
Start URLs (start_urls)ArrayRequiredList of documentation base URLs or XML sitemap URLs.
Documentation Framework (framework)StringautoChoose target framework (auto, docusaurus, gitbook, sphinx, readthedocs) to improve main content wrapper detection.
Enable Semantic Chunking (enable_chunking)BooleantrueWhen enabled, splits Markdown outputs into semantic chunks on paragraph boundaries.
Chunk Size (chunk_size)Integer1500Target character size of each chunk.
Chunk Overlap (chunk_overlap)Integer200Overlap character length between sequential chunks.
Maximum Pages to Scrape (max_pages)Integer50Maximum number of pages the crawler will visit.
Include Image Links (include_images)BooleantrueRetain image Markdown tags in extracted text.
Include Hyperlinks (include_links)BooleantrueRetain anchor link Markdown tags in extracted text.

Input Example

{
"start_urls": [
{
"url": "https://docusaurus.io/docs"
}
],
"framework": "docusaurus",
"enable_chunking": true,
"chunk_size": 1500,
"chunk_overlap": 200,
"max_pages": 100,
"include_images": true,
"include_links": true
}

Output Data Structure

The results are pushed directly to your Apify dataset. Each item represents a scraped page and has the following schema:

{
"url": "https://docs.gitbook.com/",
"title": "Overview | GitBook Documentation",
"markdown": "# Overview\n\nWelcome to the GitBook documentation portal...",
"chunks": [
"# Overview\n\nWelcome to the GitBook documentation portal...",
"To start configuring your docs, see the Git Sync integration guide..."
],
"chunk_count": 2
}

Pricing: Pay-Per-Event (PPE)

This Actor uses the transparent Pay-Per-Event pricing model, meaning you only pay for the pages you successfully scrape.

  • Price per 1,000 pages: $3.99
  • Price per page: $0.00399

Feedback & Customizations

If you encounter any issues, need to request a specific feature, or require a custom scraping solution for your business, feel free to get in touch.