# RAG-Ready Documentation Scraper (`alaricus/rag-docs-markdown-scraper`) Actor

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

- **URL**: https://apify.com/alaricus/rag-docs-markdown-scraper.md
- **Developed by:** [Alaricus](https://apify.com/alaricus) (community)
- **Categories:** AI, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.99 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## RAG-Ready Documentation Scraper

[![RAG-Ready Documentation Scraper Actor](https://apify.com/actor-badge?actor=alaricus/rag-docs-markdown-scraper)](https://apify.com/alaricus/rag-docs-markdown-scraper)

### What does the RAG-Ready Documentation Scraper do?

The **RAG-Ready Documentation Scraper** is a high-performance web crawler and content parser designed specifically for LLM, vector database, and Retrieval-Augmented Generation (RAG) pipelines. It extracts clean, structured, framework-optimized Markdown text from documentation sites and standard websites, stripping out all clutter (navigation panels, header menus, search boxes, cookie consent forms, and footer noise) to leave only pure content body.

To make the outputs immediately ready for ingestion, the actor performs **semantic paragraph-based chunking** with configurable character sizes and contextual overlaps. It also parses XML sitemaps automatically to crawl entire documentation trees with zero extra configuration.

### Key Features

- 🧹 **Boilerplate Layout Scrubbing**: Automatically detects and isolates main documentation content layouts. Eliminates menus, headers, sidebars, footer links, and cookie alerts.
- 🧩 **Semantic Chunking**: Splits extracted Markdown documents cleanly on paragraph boundaries. If any single paragraph is too large, it is split sentences/character-wise, with a configurable context overlap to avoid losing context.
- 📄 **XML Sitemap Parsing**: Simply supply a `sitemap.xml` URL as a starting point and the scraper will auto-discover and queue all links in the sitemap.
- 📦 **Framework Adaptation**: Built-in optimized container detection for popular documentation builders:
  - **Docusaurus**
  - **GitBook**
  - **Sphinx**
  - **ReadTheDocs**
  - **Auto-Detect** (for any generic blog, API reference, or standard page)
- 🖼️ **Image & Link Toggles**: Include or strip images (`![alt](url)`) and hyperlinks (`[text](url)`) on demand depending on your RAG embedding requirements.

---

### Input Parameters


| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| **Start URLs** (`start_urls`) | `Array` | *Required* | List of documentation base URLs or XML sitemap URLs. |
| **Documentation Framework** (`framework`) | `String` | `auto` | Choose target framework (`auto`, `docusaurus`, `gitbook`, `sphinx`, `readthedocs`) to improve main content wrapper detection. |
| **Enable Semantic Chunking** (`enable_chunking`) | `Boolean` | `true` | When enabled, splits Markdown outputs into semantic chunks on paragraph boundaries. |
| **Chunk Size** (`chunk_size`) | `Integer` | `1500` | Target character size of each chunk. |
| **Chunk Overlap** (`chunk_overlap`) | `Integer` | `200` | Overlap character length between sequential chunks. |
| **Maximum Pages to Scrape** (`max_pages`) | `Integer` | `50` | Maximum number of pages the crawler will visit. |
| **Include Image Links** (`include_images`) | `Boolean` | `true` | Retain image Markdown tags in extracted text. |
| **Include Hyperlinks** (`include_links`) | `Boolean` | `true` | Retain anchor link Markdown tags in extracted text. |
#### Input Example

```json
{
  "start_urls": [
    {
      "url": "https://docusaurus.io/docs"
    }
  ],
  "framework": "docusaurus",
  "enable_chunking": true,
  "chunk_size": 1500,
  "chunk_overlap": 200,
  "max_pages": 100,
  "include_images": true,
  "include_links": true
}
````

***

### Output Data Structure

The results are pushed directly to your Apify dataset. Each item represents a scraped page and has the following schema:

```json
{
  "url": "https://docs.gitbook.com/",
  "title": "Overview | GitBook Documentation",
  "markdown": "## Overview\n\nWelcome to the GitBook documentation portal...",
  "chunks": [
    "## Overview\n\nWelcome to the GitBook documentation portal...",
    "To start configuring your docs, see the Git Sync integration guide..."
  ],
  "chunk_count": 2
}
```

***

### Pricing: Pay-Per-Event (PPE)

This Actor uses the transparent **Pay-Per-Event** pricing model, meaning you only pay for the pages you successfully scrape.

- **Price per 1,000 pages**: $3.99
- **Price per page**: $0.00399

***

### Feedback & Customizations

If you encounter any issues, need to request a specific feature, or require a custom scraping solution for your business, feel free to get in touch.

- **Developer**: bd.pascari@gmail.com

# Actor input Schema

## `start_urls` (type: `array`):

List of URLs of the documentation pages or XML sitemaps to start crawling.

## `framework` (type: `string`):

Select the documentation framework to scrape. 'auto' will attempt to auto-detect.

## `enable_chunking` (type: `boolean`):

Split the generated Markdown pages into smaller text chunks suitable for vector databases.

## `chunk_size` (type: `integer`):

Approximate character size for each semantic chunk.

## `chunk_overlap` (type: `integer`):

Overlap between consecutive chunks to maintain local context.

## `max_pages` (type: `integer`):

Limit the number of pages to scrape to prevent runaway runs and excessive compute usage.

## `include_images` (type: `boolean`):

Include image markdown syntax (e.g. ![Alt text](url)) in the output.

## `include_links` (type: `boolean`):

Include markdown hyperlinks (e.g. [Link Text](url)) in the output.

## Actor input object example

```json
{
  "start_urls": [
    {
      "url": "https://docusaurus.io/docs"
    }
  ],
  "framework": "auto",
  "enable_chunking": true,
  "chunk_size": 1500,
  "chunk_overlap": 200,
  "max_pages": 50,
  "include_images": true,
  "include_links": true
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "start_urls": [
        {
            "url": "https://docusaurus.io/docs"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("alaricus/rag-docs-markdown-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "start_urls": [{ "url": "https://docusaurus.io/docs" }] }

# Run the Actor and wait for it to finish
run = client.actor("alaricus/rag-docs-markdown-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "start_urls": [
    {
      "url": "https://docusaurus.io/docs"
    }
  ]
}' |
apify call alaricus/rag-docs-markdown-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=alaricus/rag-docs-markdown-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "RAG-Ready Documentation Scraper",
        "description": "Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.",
        "version": "1.0",
        "x-build-id": "VKUDXVTY3Zk2Y2hmN"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/alaricus~rag-docs-markdown-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-alaricus-rag-docs-markdown-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/alaricus~rag-docs-markdown-scraper/runs": {
            "post": {
                "operationId": "runs-sync-alaricus-rag-docs-markdown-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/alaricus~rag-docs-markdown-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-alaricus-rag-docs-markdown-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "start_urls"
                ],
                "properties": {
                    "start_urls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "List of URLs of the documentation pages or XML sitemaps to start crawling.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "framework": {
                        "title": "Documentation Framework",
                        "enum": [
                            "auto",
                            "docusaurus",
                            "gitbook",
                            "sphinx",
                            "readthedocs"
                        ],
                        "type": "string",
                        "description": "Select the documentation framework to scrape. 'auto' will attempt to auto-detect.",
                        "default": "auto"
                    },
                    "enable_chunking": {
                        "title": "Enable Semantic Chunking",
                        "type": "boolean",
                        "description": "Split the generated Markdown pages into smaller text chunks suitable for vector databases.",
                        "default": true
                    },
                    "chunk_size": {
                        "title": "Chunk Size (Characters)",
                        "minimum": 100,
                        "type": "integer",
                        "description": "Approximate character size for each semantic chunk.",
                        "default": 1500
                    },
                    "chunk_overlap": {
                        "title": "Chunk Overlap (Characters)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Overlap between consecutive chunks to maintain local context.",
                        "default": 200
                    },
                    "max_pages": {
                        "title": "Maximum Pages to Scrape",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Limit the number of pages to scrape to prevent runaway runs and excessive compute usage.",
                        "default": 50
                    },
                    "include_images": {
                        "title": "Include Image Links",
                        "type": "boolean",
                        "description": "Include image markdown syntax (e.g. ![Alt text](url)) in the output.",
                        "default": true
                    },
                    "include_links": {
                        "title": "Include Hyperlinks",
                        "type": "boolean",
                        "description": "Include markdown hyperlinks (e.g. [Link Text](url)) in the output.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
