# RAG-Ready Markdown Converter & Chunker (`foxpink/apify-rag-markdown-chunker`) Actor

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

- **URL**: https://apify.com/foxpink/apify-rag-markdown-chunker.md
- **Developed by:** [Nguyễn Anh Duy](https://apify.com/foxpink) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 3 total users, 0 monthly users, 89.5% runs succeeded, NaN bookmarks
- **User rating**: 4.67 out of 5 stars

## Pricing

from $0.01 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## RAG-Ready Markdown Converter & Chunker

[![Run on Apify](https://img.shields.io/badge/Run%20on%20Apify-FF7754?style=for-the-badge&logo=apify)](https://console.apify.com/actors/foxpink/apify-rag-markdown-chunker)
[![Apify Marketplace](https://img.shields.io/badge/Marketplace-FF7754?style=for-the-badge&logo=apify)](https://apify.com/foxpink/apify-rag-markdown-chunker)
[![GitHub Repo](https://img.shields.io/badge/Source%20Code-181717?style=for-the-badge&logo=github)](https://github.com/FoxPink/apify-rag-markdown-chunker)
[![Version](https://img.shields.io/badge/v1.4-blue?style=for-the-badge)]()

> **Standard mode**: HTML → clean Markdown → character-based chunks with qualityScore, contentHash, and codeBlocks.  
> **Enterprise mode** (no extra charge): token-aware chunking + OpenAI Embedding + Pinecone / Qdrant auto-upsert.  
> **NEW: PDF/DOCX parsing** via fileUrls input (binary, zero-DOM).  
> **Same $0.01/1k price.**

---

### Quick Comparison

| Feature | Standard | Enterprise |
|---------|----------|------------|
| HTML → Clean Markdown | ✅ | ✅ |
| URL Fetching (auto-download HTML) | ✅ | ✅ |
| Character-based chunking | ✅ | — |
| Semantic chunking (heading-aware) | ✅ | ✅ |
| Token-aware chunking (cl100k_base) | — | ✅ |
| Natural boundary detection | ✅ | ✅ |
| Configurable overlap | ✅ | ✅ |
| Embeddings via text-embedding-3-small | — | ✅ |
| Pinecone auto-upsert | — | ✅ |
| Qdrant auto-upsert | — | ✅ |
| Bulk processing | ✅ | ✅ |
| JSONL export (LLM-ready) | ✅ | ✅ |
| Zero DOM / no browser | ✅ | ✅ |
| **Price** | $0.01/1k | $0.01/1k |

---

### Why this exists

Hundreds of Apify crawlers output raw HTML full of nav bars, footers, scripts, and ads. Feeding that into a Vector DB or LLM wastes tokens and pollutes embeddings. This Actor takes any **already-crawled** content and delivers production-ready chunks — with or without a Vector DB pipeline.

---

### Features

- **HTML → Clean Markdown** — strips scripts, styles, nav, footer, iframes, SVG, canvas, and comment garbage; converts headers, lists, tables, blockquotes, links, images, bold, italic, code into proper Markdown syntax.
- **Smart Chunking** — splits by natural boundaries (paragraph breaks, headers) with configurable overlap to preserve context; avoids cutting words mid-stream.
- **Token-Aware Chunking** (Enterprise) — uses `js-tiktoken` (cl100k_base) to split by actual LLM tokens instead of characters. Compatible with GPT-4, GPT-3.5, text-embedding-3-small.
- **Pinecone + Qdrant Auto-Upsert** (Enterprise) — generates embeddings via OpenAI `text-embedding-3-small` and upserts vectors directly to your Pinecone index **or** Qdrant collection. No glue code needed. Auto-detects which vector DB to use from your input.
- **Bulk Processing** — accepts an array of HTML documents and processes each independently with per-record chunk settings.
- **URL Fetching** — provide an array of URLs; the Actor automatically fetches and processes each one through the full pipeline.
- **Semantic Chunking** (updated in v1.4) — heading-aware chunking that respects document structure. Splits on `#` headings, keeps related content together, preserves heading context in chunk metadata. Auto-selected for content >5000 characters.
- **JSONL Export** — download chunks as JSONL (one JSON object per line) for direct LLM fine-tuning, embedding batch jobs, or LangChain/LlamaIndex ingestion.
- **Zero DOM dependency** — pure string processing; runs on any Node.js platform without a browser or headless client.
- **MCP / AI Agent Ready** — callable via API; JSON output integrates directly with LangChain, LlamaIndex, Haystack, or custom RAG pipelines.
- **Backward Compatible** — Enterprise mode activates only when you provide API keys. Standard mode works exactly as before.

---

### How Enterprise Mode Works

````

HTML → Clean Markdown → Token-aware chunking (cl100k\_base) → OpenAI Embedding → Pinecone or Qdrant upsert

````

Provide OpenAI API key + (Pinecone keys **or** Qdrant config) → the Actor auto-detects and runs the full pipeline. No configuration, no middleware, no extra services. You can even push to **both** Pinecone and Qdrant simultaneously by providing all keys.

---

### Use Cases

| Who | Why |
|-----|-----|
| RAG Pipeline Builders | Convert scraped pages → chunks → embeddings → Vector DB |
| LLM Fine-tuning | Clean training data by removing structural HTML garbage |
| AI Agents | Feed clean Markdown context to tool-calling LLMs |
| Content Analysts | Extract structured text from raw website dumps |

---

### Input

#### Standard Mode

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `htmlContent` | string | — | Raw HTML or text content to process |
| `urls` | array | `[]` | URLs to auto-fetch and process (alternative to htmlContent) |
| `chunkingStrategy` | string | `auto` | `auto`, `character`, or `semantic`. Semantic respects heading boundaries |
| `chunkSize` | integer | `1000` | Target chunk length in characters or tokens |
| `chunkOverlap` | integer | `200` | Overlap between consecutive chunks (character mode only) |
| `mode` | string | `both` | Output mode: `both`, `markdown`, or `chunks` |
| `inputRecords` | array | `[]` | Bulk input `[{ id, html, chunkSize?, chunkOverlap? }]` |
| `deduplicate` | boolean | `false` | Skip records with identical content hash |
| `minQualityScore` | integer | `0` | Minimum quality score (0–100); skip low-quality content |
| `embeddingModel` | string | `text-embedding-3-small` | OpenAI embedding model for vector DB pipeline |
| `batchSize` | integer | `50` | Max records per embedding batch (Enterprise mode) |

#### Enterprise Mode

Provide **OpenAI key + either Pinecone or Qdrant config** to activate the full pipeline.

##### Pinecone Pipeline

| Field | Type | Description |
|-------|------|-------------|
| `openaiApiKey` | string (secret) | Your OpenAI API key for token chunking + embeddings |
| `pineconeApiKey` | string (secret) | Your Pinecone API key |
| `pineconeIndex` | string | Your Pinecone index name (must be 1536-dimension) |

##### Qdrant Pipeline

| Field | Type | Description |
|-------|------|-------------|
| `openaiApiKey` | string (secret) | Same key — shared across all Enterprise features |
| `qdrantUrl` | string | Your Qdrant instance URL (e.g. `https://xxx.us-east-1-0.aws.cloud.qdrant.io`) |
| `qdrantApiKey` | string (secret) | Your Qdrant API key |
| `qdrantCollection` | string | Your Qdrant collection name (1536-dimension vectors) |

When Enterprise fields are detected, the Actor automatically:

1. Switches from character-based to **token-aware chunking** (cl100k_base)
2. Generates **1536-dimension embeddings** via `text-embedding-3-small`
3. **Upserts vectors** to Pinecone and/or Qdrant with metadata (chunkIndex, tokenCount, source text)

The Actor auto-detects which vector DB to use:
- Pinecone → requires `openaiApiKey + pineconeApiKey + pineconeIndex`
- Qdrant → requires `openaiApiKey + qdrantUrl + qdrantApiKey + qdrantCollection`
- Both → provide all keys; runs both pipelines simultaneously

##### Where to get the keys

| Key | How to get |
|-----|------------|
| `openaiApiKey` | [platform.openai.com/api-keys](https://platform.openai.com/api-keys) — create a new secret key |
| `pineconeApiKey` | [app.pinecone.io](https://app.pinecone.io) → API Keys → Copy |
| `pineconeIndex` | Create a **serverless index** with dimension **1536** (matching `text-embedding-3-small`) |
| `qdrantUrl` | [cloud.qdrant.io](https://cloud.qdrant.io) → Clusters → REST API Endpoint |
| `qdrantApiKey` | [cloud.qdrant.io](https://cloud.qdrant.io) → Clusters → API Key |
| `qdrantCollection` | Create a collection with dimension **1536** and `Cosine` distance |

---

### Output

Each processed record returns:

| Field | Type | Description |
|-------|------|-------------|
| `recordId` | string | ID of the processed record |
| `status` | string | `ok` or `empty` |
| `rawMarkdown` | string | Cleaned Markdown (if mode includes `markdown` or `both`) |
| `chunks` | array | Array of `{ chunkIndex, content, characterCount, tokenCount?, headingPath?, chunkType?, contentHash?, qualityScore?, codeBlocks? }` |
| `stats` | object | `{ rawChars, cleanedChars, totalChunks, chunkSize, chunkOverlap, chunkingMode, avgQualityScore?, totalTokens? }` |

A `summary` entry is appended at the end with aggregate statistics across all records.

---

### Pricing

**Pay Per Event** — $0.01 per 1,000 results.

One result = one processed record (not per chunk). Processing 5 records with 200 total chunks = 5 billable results. Enterprise mode costs **the same** — you only pay OpenAI and Pinecone directly for their API usage.

---

### Examples

#### Standard Mode

**Input:**
```json
{
  "htmlContent": "<html><body><h1>Hello World</h1><p>This is <strong>important</strong> content.</p></body></html>",
  "chunkSize": 500,
  "chunkOverlap": 50
}
````

**Output:**

```json
{
  "recordId": "default",
  "status": "ok",
  "rawMarkdown": "## Hello World\n\nThis is **important** content.",
  "chunks": [
    { "chunkIndex": 0, "content": "## Hello World\n\nThis is **important** content.", "characterCount": 49 }
  ],
  "stats": { "rawChars": 107, "cleanedChars": 49, "totalChunks": 1, "chunkSize": 500, "chunkOverlap": 50, "chunkingMode": "character" }
}
```

#### Enterprise Mode (Token Chunking + Pinecone)

**Input:**

```json
{
  "htmlContent": "<h1>Enterprise RAG Pipeline</h1><p>Your HTML content here...</p>",
  "chunkSize": 500,
  "chunkOverlap": 100,
  "openaiApiKey": "sk-...",
  "pineconeApiKey": "pc-...",
  "pineconeIndex": "my-rag-index"
}
```

**Output includes:**

```json
{
  "stats": { "totalChunks": 12, "totalTokens": 482, "chunkingMode": "token" },
  "chunks": [
    { "chunkIndex": 0, "content": "...", "tokenCount": 42, "characterCount": 185 }
  ]
}
```

Vectors are upserted to Pinecone automatically.

#### Enterprise Mode (Token Chunking + Qdrant)

**Input:**

```json
{
  "htmlContent": "<h1>Enterprise RAG Pipeline</h1><p>Your HTML content here...</p>",
  "chunkSize": 500,
  "chunkOverlap": 100,
  "openaiApiKey": "sk-...",
  "qdrantUrl": "https://xxx.us-east-1-0.aws.cloud.qdrant.io",
  "qdrantApiKey": "qdrant-...",
  "qdrantCollection": "my-rag-collection"
}
```

Vectors are upserted to Qdrant automatically via REST API (no SDK required).

***

### Usage with AI Agents / MCP

```json
// POST https://api.apify.com/v2/acts/foxpink~apify-rag-markdown-chunker/runs?token=YOUR_API_TOKEN
{
  "htmlContent": "<h1>Your HTML here</h1>",
  "chunkSize": 1000,
  "chunkOverlap": 200
}
```

Note: This tool operates on **already-crawled** content. Use with any Apify Web Scraper, Puppeteer, or browser-based Actor by piping its output into this Actor's `inputRecords`.

***

### Compatibility

- 100% Node.js (18+)
- No browser, no headless, no DOM
- ESM (ECMAScript Modules)

# Actor input Schema

## `htmlContent` (type: `string`):

Paste raw HTML or text content here. For bulk processing, use the inputRecords array instead.

## `chunkSize` (type: `integer`):

Maximum target length of each chunk. Interpreted as tokens in Enterprise mode (with Vector DB), otherwise characters. Default: 1000.

## `chunkOverlap` (type: `integer`):

Overlap between consecutive chunks. Interpreted as tokens in Enterprise mode, otherwise characters. Default: 200.

## `mode` (type: `string`):

Select what output to produce. 'both' returns cleaned Markdown plus chunked segments. Default: both.

## `urls` (type: `array`):

Optional: array of URLs to fetch HTML from. Each URL is fetched and processed through the HTML-to-Markdown pipeline. Alternative to providing htmlContent, fileUrls, or inputRecords.

## `fileUrls` (type: `array`):

Optional: array of direct download URLs to PDF (.pdf) or Word (.docx) files. Files are downloaded and parsed to text, then processed through the same chunking pipeline. No headless browser needed — pure Node.js binary parsers. Alternative to providing htmlContent, urls, or inputRecords.

## `chunkingStrategy` (type: `string`):

How to split content into chunks. 'character' uses fixed character counts. 'semantic' respects markdown headings. 'auto' picks semantic for content >5000 chars, character otherwise. Default: auto.

## `inputRecords` (type: `array`):

For bulk processing: provide an array of objects with 'id' and 'html' fields. Each record is processed independently with optional per-record chunkSize and chunkOverlap.

## `embeddingModel` (type: `string`):

OpenAI embedding model to use. text-embedding-3-small is default (fast, cheap, 1536-dim). text-embedding-3-large (3072-dim) for higher accuracy. text-embedding-ada-002 (1536-dim) legacy.

## `batchSize` (type: `integer`):

Number of chunks to embed per API call. Lower to avoid rate limits, higher for throughput. Default: 20.

## `openaiApiKey` (type: `string`):

Required for Enterprise mode. Used for token-aware chunking and generating Vector Embeddings via text-embedding-3-small.

## `pineconeApiKey` (type: `string`):

Pinecone API key. Provide with pineconeIndex + openaiApiKey to auto-upsert vectors to Pinecone.

## `pineconeIndex` (type: `string`):

Name of the Pinecone index (must match 1536-dimension for text-embedding-3-small).

## `qdrantUrl` (type: `string`):

Qdrant cluster URL (e.g. https://xxx.cloud.qdrant.io). Provide with qdrantApiKey + qdrantCollection + openaiApiKey to auto-upsert vectors to Qdrant.

## `qdrantApiKey` (type: `string`):

API key from your Qdrant Cloud console.

## `qdrantCollection` (type: `string`):

Name of the Qdrant collection. Must have a 1536-dimension vector configured.

## Actor input object example

```json
{
  "chunkSize": 1000,
  "chunkOverlap": 200,
  "mode": "both",
  "chunkingStrategy": "auto",
  "embeddingModel": "text-embedding-3-small",
  "batchSize": 20
}
```

# Actor output Schema

## `convertedRecords` (type: `string`):

Primary output: JSON array of processed records. Each record includes recordId, status (ok/empty), rawMarkdown (cleaned Markdown content), chunks array (with chunkIndex, content, characterCount), and stats (rawChars, cleanedChars, totalChunks, chunkSize, chunkOverlap).

## `chunksOnly` (type: `string`):

Filtered view: only the chunked segments with recordId and chunks array. Ideal for direct ingestion into Vector DBs like Pinecone, Weaviate, Qdrant, or Chroma. Each chunk includes its content and character count for token estimation.

## `summary` (type: `string`):

Aggregate statistics for the entire run: version, total records processed, total chunks produced, and per-record breakdown with status, totalChunks, and cleanedChars. Stored as the last dataset item under key 'summary'. Useful for pipeline monitoring and cost estimation.

## `jsonlExport` (type: `string`):

Download all chunks as JSONL format (one JSON object per line). Each line has 'text' and 'source' fields — ready for fine-tuning LLMs, embedding batch jobs, or direct ingestion into LangChain/LlamaIndex pipelines.

## `csvExport` (type: `string`):

Download all processed records as a CSV file ready for spreadsheet analysis. Columns: recordId, status, rawChars, cleanedChars, totalChunks, chunkSize, chunkOverlap.

## `jsonExport` (type: `string`):

Download all processed records as formatted JSON with indentation. Suitable for ETL pipelines, AI workflow tools, and developer handoff.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("foxpink/apify-rag-markdown-chunker").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("foxpink/apify-rag-markdown-chunker").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call foxpink/apify-rag-markdown-chunker --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=foxpink/apify-rag-markdown-chunker",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "RAG-Ready Markdown Converter & Chunker",
        "description": "Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.",
        "version": "1.4",
        "x-build-id": "6q3KyTKt60uMyBXKx"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/foxpink~apify-rag-markdown-chunker/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-foxpink-apify-rag-markdown-chunker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/foxpink~apify-rag-markdown-chunker/runs": {
            "post": {
                "operationId": "runs-sync-foxpink-apify-rag-markdown-chunker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/foxpink~apify-rag-markdown-chunker/run-sync": {
            "post": {
                "operationId": "run-sync-foxpink-apify-rag-markdown-chunker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "htmlContent": {
                        "title": "Raw HTML / Text Content",
                        "type": "string",
                        "description": "Paste raw HTML or text content here. For bulk processing, use the inputRecords array instead."
                    },
                    "chunkSize": {
                        "title": "Chunk Size (characters or tokens)",
                        "minimum": 50,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum target length of each chunk. Interpreted as tokens in Enterprise mode (with Vector DB), otherwise characters. Default: 1000.",
                        "default": 1000
                    },
                    "chunkOverlap": {
                        "title": "Chunk Overlap (characters or tokens)",
                        "minimum": 0,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Overlap between consecutive chunks. Interpreted as tokens in Enterprise mode, otherwise characters. Default: 200.",
                        "default": 200
                    },
                    "mode": {
                        "title": "Output Mode",
                        "enum": [
                            "both",
                            "markdown",
                            "chunks"
                        ],
                        "type": "string",
                        "description": "Select what output to produce. 'both' returns cleaned Markdown plus chunked segments. Default: both.",
                        "default": "both"
                    },
                    "urls": {
                        "title": "URLs to Fetch (HTML)",
                        "type": "array",
                        "description": "Optional: array of URLs to fetch HTML from. Each URL is fetched and processed through the HTML-to-Markdown pipeline. Alternative to providing htmlContent, fileUrls, or inputRecords.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "fileUrls": {
                        "title": "File URLs to Parse (PDF/DOCX)",
                        "type": "array",
                        "description": "Optional: array of direct download URLs to PDF (.pdf) or Word (.docx) files. Files are downloaded and parsed to text, then processed through the same chunking pipeline. No headless browser needed — pure Node.js binary parsers. Alternative to providing htmlContent, urls, or inputRecords.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "chunkingStrategy": {
                        "title": "Chunking Strategy",
                        "enum": [
                            "auto",
                            "character",
                            "semantic"
                        ],
                        "type": "string",
                        "description": "How to split content into chunks. 'character' uses fixed character counts. 'semantic' respects markdown headings. 'auto' picks semantic for content >5000 chars, character otherwise. Default: auto.",
                        "default": "auto"
                    },
                    "inputRecords": {
                        "title": "Bulk Input Records",
                        "type": "array",
                        "description": "For bulk processing: provide an array of objects with 'id' and 'html' fields. Each record is processed independently with optional per-record chunkSize and chunkOverlap.",
                        "items": {
                            "type": "object",
                            "required": [
                                "id",
                                "html"
                            ],
                            "properties": {
                                "id": {
                                    "title": "Record ID",
                                    "type": "string",
                                    "description": "Unique record identifier"
                                },
                                "html": {
                                    "title": "HTML Content",
                                    "type": "string",
                                    "description": "Raw HTML content to process"
                                },
                                "chunkSize": {
                                    "title": "Chunk Size Override",
                                    "type": "integer",
                                    "minimum": 50,
                                    "maximum": 10000,
                                    "description": "Optional per-record chunk size override"
                                },
                                "chunkOverlap": {
                                    "title": "Chunk Overlap Override",
                                    "type": "integer",
                                    "minimum": 0,
                                    "maximum": 5000,
                                    "description": "Optional per-record chunk overlap override"
                                }
                            }
                        }
                    },
                    "embeddingModel": {
                        "title": "Embedding Model",
                        "enum": [
                            "text-embedding-3-small",
                            "text-embedding-3-large",
                            "text-embedding-ada-002"
                        ],
                        "type": "string",
                        "description": "OpenAI embedding model to use. text-embedding-3-small is default (fast, cheap, 1536-dim). text-embedding-3-large (3072-dim) for higher accuracy. text-embedding-ada-002 (1536-dim) legacy.",
                        "default": "text-embedding-3-small"
                    },
                    "batchSize": {
                        "title": "Embedding Batch Size",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Number of chunks to embed per API call. Lower to avoid rate limits, higher for throughput. Default: 20.",
                        "default": 20
                    },
                    "openaiApiKey": {
                        "title": "OpenAI API Key (Enterprise)",
                        "type": "string",
                        "description": "Required for Enterprise mode. Used for token-aware chunking and generating Vector Embeddings via text-embedding-3-small."
                    },
                    "pineconeApiKey": {
                        "title": "Pinecone API Key",
                        "type": "string",
                        "description": "Pinecone API key. Provide with pineconeIndex + openaiApiKey to auto-upsert vectors to Pinecone."
                    },
                    "pineconeIndex": {
                        "title": "Pinecone Index Name",
                        "type": "string",
                        "description": "Name of the Pinecone index (must match 1536-dimension for text-embedding-3-small)."
                    },
                    "qdrantUrl": {
                        "title": "Qdrant Cluster URL",
                        "type": "string",
                        "description": "Qdrant cluster URL (e.g. https://xxx.cloud.qdrant.io). Provide with qdrantApiKey + qdrantCollection + openaiApiKey to auto-upsert vectors to Qdrant."
                    },
                    "qdrantApiKey": {
                        "title": "Qdrant API Key",
                        "type": "string",
                        "description": "API key from your Qdrant Cloud console."
                    },
                    "qdrantCollection": {
                        "title": "Qdrant Collection Name",
                        "type": "string",
                        "description": "Name of the Qdrant collection. Must have a 1536-dimension vector configured."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
