# Docs-to-RAG Crawler (`automation-lab/docs-rag-crawler`) Actor

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) and output chunked markdown ready for RAG/vector DB ingestion. Splits by heading hierarchy, strips nav/sidebar chrome.

- **URL**: https://apify.com/automation-lab/docs-rag-crawler.md
- **Developed by:** [Stas Persiianenko](https://apify.com/automation-lab) (community)
- **Categories:** Developer tools, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Docs-to-RAG Crawler

Crawl any documentation website and export **chunked markdown** optimized for RAG pipelines, vector databases, and AI knowledge bases. Supports [ReadTheDocs](https://readthedocs.org/), [GitBook](https://gitbook.com/), [Docusaurus](https://docusaurus.io/), [Mintlify](https://mintlify.com/), and any generic HTML documentation site.

No API key required. No browser overhead. Pure HTTP crawling for maximum speed.

### What does Docs-to-RAG Crawler do?

Docs-to-RAG Crawler visits every page of a documentation site, strips navigation/sidebar/footer chrome, extracts the core content, and splits it into **heading-bounded chunks** ready for vector DB ingestion.

Each chunk includes: a stable `chunkId` (hash of URL + heading), the heading hierarchy as a breadcrumb (e.g., `"Getting Started > Installation > Requirements"`), clean markdown content, word count, and metadata (site name, crawl timestamp, detected platform).

**What makes it different from a generic web crawler:**
- 🧠 Understands documentation structure — splits on H2/H3 boundaries, not arbitrary character limits
- 🏗️ Detects docs platforms automatically (Docusaurus, ReadTheDocs, GitBook, Mintlify) and applies platform-specific noise removal
- 🔗 Follows sidebar/nav links intelligently — finds the full documentation tree, not just pages linked from the start URL
- ⚡ Pure HTTP (Cheerio) — 10-50x faster than browser-based crawlers, minimal memory usage
- 📦 Outputs stable chunk IDs — safe to re-crawl and upsert into vector DBs without duplicates

### Who is Docs-to-RAG Crawler for?

**AI engineers building RAG applications**
- Index your product's own documentation into a vector DB for customer-facing chatbots
- Build a "chat with any docs" feature by crawling third-party docs at startup
- Keep your knowledge base fresh by scheduling weekly re-crawls

**Developers building internal tools**
- Turn scattered internal wikis into a searchable knowledge base
- Feed company documentation into LLM-powered search
- Build code assistants that understand your specific framework's docs

**Data teams and researchers**
- Benchmark embedding models across real documentation corpora
- Build multi-doc retrieval datasets for fine-tuning
- Compare documentation quality across competing libraries

**DevOps and platform teams**
- Automate documentation ingestion into Confluence, Notion, or Slack bots
- Trigger re-ingestion automatically on new releases via webhooks
- Monitor documentation coverage gaps by tracking chunk counts per section

### Why use Docs-to-RAG Crawler?

- ✅ **Works with modern docs sites** — Docusaurus v2/v3, ReadTheDocs, GitBook, Mintlify, plain HTML
- ✅ **Heading-aware chunking** — splits at semantic H2/H3 boundaries, not mid-paragraph
- ✅ **Stable chunk IDs** — SHA-1 based on URL + heading, safe for upserts on re-crawl
- ✅ **Platform-specific noise removal** — strips breadcrumbs, version badges, "On this page" widgets, edit buttons
- ✅ **Code block control** — keep or strip code blocks depending on your embedding model's needs
- ✅ **Fast** — pure HTTP, no browser, processes 400+ pages/minute
- ✅ **No proxy required** — documentation sites are public and don't block scrapers
- ✅ **Exclude patterns** — skip changelog, blog, or release notes pages with glob patterns
- ✅ **Scheduling support** — set up weekly re-crawls to keep your knowledge base fresh

### What data can you extract?

Each output chunk contains:

| Field | Type | Example |
|-------|------|---------|
| `chunkId` | string | `"installation-requirements-a3f8b1c2"` |
| `title` | string | `"Installation Requirements"` |
| `url` | string | `"https://docs.example.com/getting-started/install"` |
| `headingHierarchy` | string | `"Getting Started > Installation > Requirements"` |
| `content` | string | Full markdown content of this chunk |
| `wordCount` | number | `142` |
| `metadata.siteName` | string | `"My Project Docs"` |
| `metadata.scrapedAt` | string | `"2026-04-05T10:00:00.000Z"` |
| `metadata.platform` | string | `"docusaurus"` |

**Example output chunk:**

```json
{
    "chunkId": "authentication-api-keys-d4e9f1a3",
    "title": "API Keys",
    "url": "https://docs.example.com/authentication",
    "headingHierarchy": "Authentication > API Keys",
    "content": "### API Keys\n\nTo authenticate with the API, pass your API key in the `Authorization` header:\n\n```\ncurl -H 'Authorization: Bearer YOUR_KEY' https://api.example.com/data\n```\n\nAPI keys are scoped to your account and can be revoked at any time from your dashboard.",
    "wordCount": 48,
    "metadata": {
        "siteName": "Example Docs",
        "scrapedAt": "2026-04-05T10:00:00.000Z",
        "platform": "docusaurus"
    }
}
````

### How much does it cost to crawl a documentation site?

Docs-to-RAG Crawler uses **pay-per-event (PPE) pricing**: you pay per page crawled, not per run. There is a small one-time start fee plus a per-page fee.

| Plan | Start fee | Per page fee | Example: 100 pages |
|------|-----------|--------------|-------------------|
| FREE | $0.010 | $0.003 | ~$0.31 |
| BRONZE | $0.0095 | $0.0027 | ~$0.28 |
| SILVER | $0.0085 | $0.0024 | ~$0.25 |
| GOLD | $0.0075 | $0.00195 | ~$0.20 |
| PLATINUM | $0.006 | $0.0015 | ~$0.155 |
| DIAMOND | $0.005 | $0.0012 | ~$0.125 |

**Real-world examples:**

- Small library docs (50 pages) ≈ $0.16 on FREE plan
- Medium framework docs (200 pages) ≈ $0.61 on FREE plan
- Large documentation site (500 pages) ≈ $1.51 on FREE plan
- Enterprise docs with 1000+ pages ≈ $3.01 on FREE plan

**Free plan credit:** New Apify accounts get $5 in free credits — enough to crawl **~1,600 pages** at FREE tier pricing.

### How to crawl documentation for RAG

1. Go to [Docs-to-RAG Crawler](https://apify.com/automation-lab/docs-rag-crawler) on Apify Store
2. Click **Try for free**
3. In the **Documentation URL** field, enter the root URL of the docs you want to crawl (e.g., `https://docs.example.com`)
4. Set **Max pages** (start with 20-50 to preview results, then increase)
5. Choose **Chunk mode**: `heading` (recommended) splits at H2/H3 boundaries; `page` outputs one chunk per full page
6. Click **Start** and wait for the crawl to finish
7. Click **Export** → JSON or CSV to download your chunks
8. Load the JSON into your vector database (Pinecone, Weaviate, Chroma, Qdrant, etc.)

**Example inputs for common scenarios:**

Crawl a Docusaurus site, skip blog and changelog:

```json
{
    "startUrl": "https://docusaurus.io/docs",
    "maxPages": 200,
    "chunkMode": "heading",
    "maxChunkWords": 300,
    "excludePatterns": ["*/blog/*", "*/changelog/*"]
}
```

Crawl a ReadTheDocs project, text-only (no code):

```json
{
    "startUrl": "https://docs.python-requests.org/en/latest/",
    "maxPages": 50,
    "chunkMode": "heading",
    "includeCodeBlocks": false
}
```

Crawl a GitBook space, one chunk per page:

```json
{
    "startUrl": "https://docs.myapp.gitbook.io/",
    "maxPages": 100,
    "chunkMode": "page",
    "maxChunkWords": 1000
}
```

### Input parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `startUrl` | string | — | Root URL of the documentation site to crawl (required) |
| `maxPages` | integer | 100 | Maximum number of pages to crawl |
| `includeCodeBlocks` | boolean | true | Whether to include code blocks in chunks |
| `chunkMode` | string | `"heading"` | `"heading"` splits at H2/H3; `"page"` outputs one chunk per full page |
| `maxChunkWords` | integer | 300 | Maximum words per chunk; oversized sections are split at paragraph boundaries |
| `linkSelector` | string | — | Custom CSS selector for navigation links (leave empty for auto-detection) |
| `excludePatterns` | array | \[] | URL glob patterns to skip (e.g., `"*/blog/*"`, `"*/release-notes/*"`) |
| `waitForSelector` | string | — | Reserved for future JS-rendering mode |

### Output examples

**Heading-mode output (chunkMode: "heading"):**

```json
[
    {
        "chunkId": "quick-start-5fd5eded",
        "title": "Quick Start",
        "url": "https://crawlee.dev/docs/quick-start",
        "headingHierarchy": "Quick Start",
        "content": "## Quick Start\n\nWith this short tutorial you can start scraping with Crawlee in a minute or two.",
        "wordCount": 22,
        "metadata": {
            "siteName": "Crawlee",
            "scrapedAt": "2026-04-05T10:00:00.000Z",
            "platform": "docusaurus"
        }
    },
    {
        "chunkId": "cheerio-crawler-b1a2c3d4",
        "title": "CheerioCrawler",
        "url": "https://crawlee.dev/docs/quick-start",
        "headingHierarchy": "Quick Start > Choose your crawler > CheerioCrawler",
        "content": "#### CheerioCrawler\n\nCheerioCrawler downloads each URL using a plain HTTP request and parses the HTML with Cheerio...",
        "wordCount": 64,
        "metadata": {
            "siteName": "Crawlee",
            "scrapedAt": "2026-04-05T10:00:00.000Z",
            "platform": "docusaurus"
        }
    }
]
```

### Tips for best results

- 🎯 **Start small** — set `maxPages: 20` first to preview chunk quality, then increase
- 🔗 **Use the docs root** — set `startUrl` to the documentation root (e.g., `/docs` or `/docs/intro`), not a specific page
- ✂️ **Tune chunk size** — 200-400 words per chunk works well for most embedding models (~270-530 tokens). Smaller chunks = more precise retrieval but more DB entries
- 🚫 **Exclude noise pages** — use `excludePatterns` to skip changelog, blog, API reference (auto-generated), and release notes pages
- 💻 **Code blocks** — keep `includeCodeBlocks: true` for code-heavy docs (frameworks, SDKs). Set to `false` for prose-heavy docs (tutorials, guides) where code snippets reduce embedding quality
- 🔄 **Schedule re-crawls** — set up a weekly cron to keep your knowledge base fresh; stable `chunkId`s let you upsert without duplicates
- 📊 **Page mode for long-form content** — use `chunkMode: "page"` for docs with very long pages and few headings (like API references with a single long page)

### Integrations

**Docs-to-RAG Crawler → Pinecone (auto-indexed knowledge base)**

- Run actor on a schedule (weekly) → export dataset as JSON → load into Pinecone with chunk `content` as text and `chunkId` as vector ID → upsert without worrying about duplicates

**Docs-to-RAG Crawler → OpenAI Embeddings → Weaviate**

- Export chunks as JSON → batch-process through `text-embedding-3-small` → store vectors with `headingHierarchy` and `url` as metadata for rich retrieval

**Docs-to-RAG Crawler → Make/Zapier → Slack bot**

- Trigger on new dataset items → embed each chunk → surface answers in a Slack chatbot that cites source URLs

**Docs-to-RAG Crawler → Google Sheets**

- Use Apify's Google Sheets integration to export all chunks to a spreadsheet — useful for reviewing chunk quality before ingesting into a vector DB

**Docs-to-RAG Crawler → Webhook → LlamaIndex pipeline**

- Set up a webhook to trigger your LlamaIndex indexing pipeline immediately when the crawl completes — zero manual steps for fresh documentation indexing

**Scheduled re-indexing workflow**

- Create a daily/weekly schedule on Apify → actor outputs only pages that exist in the current crawl → diff against your vector DB to add new, update changed, and remove deleted chunks using the stable `chunkId`

### Using the Apify API

#### Node.js

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/docs-rag-crawler').call({
    startUrl: 'https://docs.example.com',
    maxPages: 100,
    chunkMode: 'heading',
    maxChunkWords: 300,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Collected ${items.length} chunks`);
```

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient(token='YOUR_APIFY_TOKEN')

run = client.actor('automation-lab/docs-rag-crawler').call(run_input={
    'startUrl': 'https://docs.example.com',
    'maxPages': 100,
    'chunkMode': 'heading',
    'maxChunkWords': 300,
})

items = client.dataset(run['defaultDatasetId']).list_items()['items']
print(f'Collected {len(items)} chunks')
```

#### cURL

```bash
curl -X POST \
  "https://api.apify.com/v2/acts/automation-lab~docs-rag-crawler/runs" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://docs.example.com",
    "maxPages": 100,
    "chunkMode": "heading",
    "maxChunkWords": 300
  }'
```

### Use with AI agents via MCP

Docs-to-RAG Crawler is available as a tool for AI assistants that support the [Model Context Protocol (MCP)](https://docs.apify.com/platform/integrations/mcp).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

#### Setup for Claude Code

```bash
claude mcp add --transport http apify "https://mcp.apify.com?token=YOUR_APIFY_TOKEN"
```

#### Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

```json
{
    "mcpServers": {
        "apify": {
            "type": "http",
            "url": "https://mcp.apify.com?token=YOUR_APIFY_TOKEN&tools=automation-lab/docs-rag-crawler"
        }
    }
}
```

#### Example prompts

Once connected, you can ask your AI assistant:

- *"Crawl https://docs.fastapi.tiangolo.com and output 300-word heading chunks for RAG ingestion"*
- *"Index the Stripe API docs at https://stripe.com/docs, skip the changelog pages, max 500 pages"*
- *"Crawl https://docs.langchain.com, text only (no code blocks), and export as chunks for embedding"*

### Is it legal to crawl documentation sites?

Docs-to-RAG Crawler is designed for **ethical use of publicly available documentation**.

Most documentation sites explicitly encourage crawling and indexing — that's the point of public documentation. However, always check the site's `robots.txt` and Terms of Service before crawling at scale.

**Best practices:**

- Only crawl documentation you have permission to use
- Respect `robots.txt` crawl delays and disallow rules
- Don't crawl private or authenticated documentation without authorization
- GDPR: documentation sites rarely contain personal data, but be mindful if internal wikis do

This actor crawls public pages using standard HTTP requests, the same as a web browser. It does not bypass authentication, CAPTCHA systems, or access controls.

### FAQ

**How many chunks does a typical documentation site produce?**

A small library docs site (50 pages) typically produces 200-800 chunks. A large framework like React or Next.js (300+ pages) can produce 2,000-5,000 chunks with heading-mode chunking.

**How long does a crawl take?**

Very fast — pure HTTP, no JavaScript rendering needed. A 100-page docs site typically completes in 20-30 seconds. A 500-page site in under 2 minutes.

**What is the recommended chunk size for RAG?**

200-400 words per chunk (~270-530 tokens for GPT-4 tokenizer) is the sweet spot for most retrieval scenarios. Smaller chunks improve precision but require more DB storage and API calls. Larger chunks preserve more context but may dilute retrieval accuracy.

**Why are some chunks very short or empty?**

Short chunks usually come from heading-only sections or navigation artifacts. The actor automatically skips chunks with fewer than 5 words. Very short but non-empty chunks (10-50 words) are kept because they may contain important introductory text.

**Why does the crawler only find some pages and not others?**

The crawler discovers pages by following links in the navigation sidebar and content area. If your docs site uses JavaScript to load the sidebar dynamically, Cheerio mode won't see those links. In that case, try setting `linkSelector` to a CSS selector that targets the nav links in the raw HTML (check the source with Ctrl+U in your browser). GitBook sites that render entirely via JS may need a future Playwright-mode upgrade.

**The chunks contain "Version: 3.x" or "On this page" text. How do I remove it?**

These artifacts are stripped automatically by the actor's Docusaurus-specific noise filters. If you see them, please open an issue with the URL and we'll add a filter for that platform.

**The actor crawled pages from a different subdomain — why?**

The crawler only follows links to the same hostname as the `startUrl`. If you see pages from a different subdomain, it means the docs site has a subdomain structure (e.g., `docs.example.com` linking to `api.example.com`). Use `excludePatterns` to filter those out, or open a feature request for subdomain scoping.

### Other documentation and AI tools

Explore more automation-lab actors for AI and data workflows:

- [Web Scraper](https://apify.com/automation-lab) — general purpose web scraping
- [JSON Schema Generator](https://apify.com/automation-lab/json-schema-generator) — auto-generate JSON schemas from sample data
- [Color Contrast Checker](https://apify.com/automation-lab/color-contrast-checker) — WCAG 2.1 AA/AAA accessibility validation

# Actor input Schema

## `startUrl` (type: `string`):

Enter the root URL of the documentation site to crawl (e.g. https://docs.example.com). The crawler will follow internal links and collect all documentation pages.

## `maxPages` (type: `integer`):

Set the maximum number of documentation pages to crawl. Lower values = faster, cheaper runs. Start small to preview results.

## `includeCodeBlocks` (type: `boolean`):

Keep code blocks (`...`) in output chunks. Disable to produce text-only chunks for embedding models that perform poorly on code.

## `chunkMode` (type: `string`):

Choose how pages are split into chunks. 'heading' splits on H2/H3 boundaries (recommended for most docs). 'page' outputs one chunk per full page.

## `maxChunkWords` (type: `integer`):

Maximum number of words per chunk. Chunks longer than this are split further at paragraph boundaries. ~250 words ≈ 330 tokens (GPT-4 tokenizer).

## `linkSelector` (type: `string`):

CSS selector to find navigation links. Leave empty for auto-detection (recommended). Override only if the site uses unusual nav structure.

## `excludePatterns` (type: `array`):

List URL patterns to skip (glob-style). For example: '*/changelog/*', '*/blog/*', '*/release-notes/*'.

## `waitForSelector` (type: `string`):

If the docs site needs JavaScript rendering, enter a CSS selector to wait for (e.g. '.markdown-body'). Leave empty for pure HTML sites (uses faster Cheerio mode).

## Actor input object example

```json
{
  "startUrl": "https://crawlee.dev/docs/quick-start",
  "maxPages": 20,
  "includeCodeBlocks": true,
  "chunkMode": "heading",
  "maxChunkWords": 300,
  "linkSelector": "",
  "excludePatterns": [],
  "waitForSelector": ""
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://crawlee.dev/docs/quick-start",
    "maxPages": 20,
    "maxChunkWords": 300,
    "linkSelector": "",
    "excludePatterns": []
};

// Run the Actor and wait for it to finish
const run = await client.actor("automation-lab/docs-rag-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrl": "https://crawlee.dev/docs/quick-start",
    "maxPages": 20,
    "maxChunkWords": 300,
    "linkSelector": "",
    "excludePatterns": [],
}

# Run the Actor and wait for it to finish
run = client.actor("automation-lab/docs-rag-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://crawlee.dev/docs/quick-start",
  "maxPages": 20,
  "maxChunkWords": 300,
  "linkSelector": "",
  "excludePatterns": []
}' |
apify call automation-lab/docs-rag-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=automation-lab/docs-rag-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

````json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Docs-to-RAG Crawler",
        "description": "Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) and output chunked markdown ready for RAG/vector DB ingestion. Splits by heading hierarchy, strips nav/sidebar chrome.",
        "version": "0.1",
        "x-build-id": "4fs5EVZ619zSXfyp2"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/automation-lab~docs-rag-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-automation-lab-docs-rag-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/automation-lab~docs-rag-crawler/runs": {
            "post": {
                "operationId": "runs-sync-automation-lab-docs-rag-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/automation-lab~docs-rag-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-automation-lab-docs-rag-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "📄 Documentation URL",
                        "pattern": "^https?://",
                        "type": "string",
                        "description": "Enter the root URL of the documentation site to crawl (e.g. https://docs.example.com). The crawler will follow internal links and collect all documentation pages."
                    },
                    "maxPages": {
                        "title": "📦 Max pages",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Set the maximum number of documentation pages to crawl. Lower values = faster, cheaper runs. Start small to preview results.",
                        "default": 100
                    },
                    "includeCodeBlocks": {
                        "title": "💻 Include code blocks",
                        "type": "boolean",
                        "description": "Keep code blocks (```...```) in output chunks. Disable to produce text-only chunks for embedding models that perform poorly on code.",
                        "default": true
                    },
                    "chunkMode": {
                        "title": "✂️ Chunk mode",
                        "enum": [
                            "heading",
                            "page"
                        ],
                        "type": "string",
                        "description": "Choose how pages are split into chunks. 'heading' splits on H2/H3 boundaries (recommended for most docs). 'page' outputs one chunk per full page.",
                        "default": "heading"
                    },
                    "maxChunkWords": {
                        "title": "📏 Max chunk size (words)",
                        "minimum": 50,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Maximum number of words per chunk. Chunks longer than this are split further at paragraph boundaries. ~250 words ≈ 330 tokens (GPT-4 tokenizer).",
                        "default": 300
                    },
                    "linkSelector": {
                        "title": "🔗 Link selector (CSS)",
                        "type": "string",
                        "description": "CSS selector to find navigation links. Leave empty for auto-detection (recommended). Override only if the site uses unusual nav structure.",
                        "default": ""
                    },
                    "excludePatterns": {
                        "title": "🚫 Exclude URL patterns",
                        "type": "array",
                        "description": "List URL patterns to skip (glob-style). For example: '*/changelog/*', '*/blog/*', '*/release-notes/*'.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "waitForSelector": {
                        "title": "⏳ Wait for selector (JS sites)",
                        "type": "string",
                        "description": "If the docs site needs JavaScript rendering, enter a CSS selector to wait for (e.g. '.markdown-body'). Leave empty for pure HTML sites (uses faster Cheerio mode).",
                        "default": ""
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
````
