# Text Splitter & Chunker for RAG / LLMs (`zenomastro/text-splitter-for-llm`) Actor

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

- **URL**: https://apify.com/zenomastro/text-splitter-for-llm.md
- **Developed by:** [Rosario Vitale](https://apify.com/zenomastro) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $5.00 / 1,000 text chunkeds

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Text Splitter & Chunker for RAG / LLMs

Split any text into clean, overlapping **chunks** that are ready for embeddings, vector databases, RAG pipelines and LLM context windows — without writing your own splitter.

Paste text (or send many documents), pick a chunk size and overlap, and get back tidy chunks with character counts and approximate token counts as JSON or CSV.

### Why

Every RAG / LLM pipeline needs chunking, and everyone re-implements the same fiddly logic: respect paragraph and sentence boundaries, keep an overlap so context isn't lost, normalize messy whitespace, and estimate tokens. This Actor does it for you, reliably, in one call.

### Features

- ✂️ **Smart chunking** — packs text up to your target size while respecting paragraph/sentence boundaries.
- 🔁 **Overlap** — keeps a configurable overlap so ideas spanning a boundary aren't lost.
- 🔢 **Characters or tokens** — size and overlap in characters or approximate tokens (~4 chars/token).
- 🧹 **Cleaning** — normalizes whitespace and collapses excessive blank lines.
- 📦 **Batch** — split many documents in a single run.
- 📊 **Token estimate** — every chunk includes `charCount` and `approxTokens`.

### Input

| Field | Type | Description |
|---|---|---|
| `text` | string | A single document to split. |
| `texts` | array | Multiple documents (one per item). |
| `chunkSize` | integer | Target chunk size. Default `1000`. |
| `chunkOverlap` | integer | Overlap between chunks. Default `100`. |
| `unit` | select | `characters` or `tokens`. Default `characters`. |
| `splitBy` | select | `paragraph`, `sentence` or `character`. Default `paragraph`. |
| `clean` | boolean | Normalize whitespace. Default `true`. |

#### Example input

```json
{
    "text": "Your long document text goes here...",
    "chunkSize": 1000,
    "chunkOverlap": 100,
    "unit": "characters",
    "splitBy": "paragraph",
    "clean": true
}
````

### Output

One dataset item per chunk:

```json
{
    "sourceIndex": 0,
    "chunkIndex": 0,
    "totalChunks": 3,
    "text": "Retrieval-Augmented Generation (RAG) combines a language model ...",
    "charCount": 312,
    "approxTokens": 78
}
```

Export as **JSON, CSV, or Excel**, or pull via the Apify API — then send the chunks straight to your embeddings model or vector DB.

### Common use cases

- Prepare documents for **embeddings + vector search** (Pinecone, Qdrant, Weaviate, pgvector).
- Build **RAG** context for ChatGPT/Claude apps.
- Fit long content into **LLM context windows**.
- Pairs perfectly with **PDF to Structured Data** — extract text from PDFs, then chunk it here.

### Notes

- Token counts are an **estimate** (~4 characters per token); exact tokenization depends on the model.
- For `character` split mode the text is hard-cut at the size boundary; `paragraph`/`sentence` respect natural boundaries.

# Actor input Schema

## `text` (type: `string`):

A single document to split into chunks. Use this for one long text.

## `texts` (type: `array`):

Several documents to split in one run. Each array item is one document.

## `chunkSize` (type: `integer`):

Target size of each chunk (in characters or tokens, see Unit).

## `chunkOverlap` (type: `integer`):

How much each chunk overlaps the previous one (in characters or tokens). Helps preserve context across boundaries.

## `unit` (type: `string`):

Whether chunk size and overlap are measured in characters or approximate tokens (~4 chars/token).

## `splitBy` (type: `string`):

Preferred boundary to split on before packing chunks.

## `clean` (type: `boolean`):

Normalize whitespace and collapse excessive blank lines before splitting.

## Actor input object example

```json
{
  "text": "Retrieval-Augmented Generation (RAG) combines a language model with an external knowledge base. Instead of relying only on what the model memorized during training, RAG retrieves relevant chunks of text and feeds them to the model as context.\n\nTo build a RAG system you first split your documents into chunks, create embeddings for each chunk, and store them in a vector database. At query time you embed the user's question, find the most similar chunks, and pass them to the model alongside the prompt.\n\nChunking matters a lot. Chunks that are too large dilute relevance and waste tokens, while chunks that are too small lose context. A common starting point is around 1000 characters per chunk with a small overlap, so that ideas spanning a boundary are not lost between neighbouring chunks.",
  "chunkSize": 1000,
  "chunkOverlap": 100,
  "unit": "characters",
  "splitBy": "paragraph",
  "clean": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "text": `Retrieval-Augmented Generation (RAG) combines a language model with an external knowledge base. Instead of relying only on what the model memorized during training, RAG retrieves relevant chunks of text and feeds them to the model as context.

To build a RAG system you first split your documents into chunks, create embeddings for each chunk, and store them in a vector database. At query time you embed the user's question, find the most similar chunks, and pass them to the model alongside the prompt.

Chunking matters a lot. Chunks that are too large dilute relevance and waste tokens, while chunks that are too small lose context. A common starting point is around 1000 characters per chunk with a small overlap, so that ideas spanning a boundary are not lost between neighbouring chunks.`
};

// Run the Actor and wait for it to finish
const run = await client.actor("zenomastro/text-splitter-for-llm").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "text": """Retrieval-Augmented Generation (RAG) combines a language model with an external knowledge base. Instead of relying only on what the model memorized during training, RAG retrieves relevant chunks of text and feeds them to the model as context.

To build a RAG system you first split your documents into chunks, create embeddings for each chunk, and store them in a vector database. At query time you embed the user's question, find the most similar chunks, and pass them to the model alongside the prompt.

Chunking matters a lot. Chunks that are too large dilute relevance and waste tokens, while chunks that are too small lose context. A common starting point is around 1000 characters per chunk with a small overlap, so that ideas spanning a boundary are not lost between neighbouring chunks.""" }

# Run the Actor and wait for it to finish
run = client.actor("zenomastro/text-splitter-for-llm").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "text": "Retrieval-Augmented Generation (RAG) combines a language model with an external knowledge base. Instead of relying only on what the model memorized during training, RAG retrieves relevant chunks of text and feeds them to the model as context.\\n\\nTo build a RAG system you first split your documents into chunks, create embeddings for each chunk, and store them in a vector database. At query time you embed the user'\''s question, find the most similar chunks, and pass them to the model alongside the prompt.\\n\\nChunking matters a lot. Chunks that are too large dilute relevance and waste tokens, while chunks that are too small lose context. A common starting point is around 1000 characters per chunk with a small overlap, so that ideas spanning a boundary are not lost between neighbouring chunks."
}' |
apify call zenomastro/text-splitter-for-llm --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=zenomastro/text-splitter-for-llm",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Text Splitter & Chunker for RAG / LLMs",
        "description": "Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.",
        "version": "0.1",
        "x-build-id": "TAmcOahZdT6Xy2ybS"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/zenomastro~text-splitter-for-llm/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-zenomastro-text-splitter-for-llm",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/zenomastro~text-splitter-for-llm/runs": {
            "post": {
                "operationId": "runs-sync-zenomastro-text-splitter-for-llm",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/zenomastro~text-splitter-for-llm/run-sync": {
            "post": {
                "operationId": "run-sync-zenomastro-text-splitter-for-llm",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "text": {
                        "title": "Text",
                        "type": "string",
                        "description": "A single document to split into chunks. Use this for one long text."
                    },
                    "texts": {
                        "title": "Texts (multiple)",
                        "type": "array",
                        "description": "Several documents to split in one run. Each array item is one document.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "chunkSize": {
                        "title": "Chunk size",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Target size of each chunk (in characters or tokens, see Unit).",
                        "default": 1000
                    },
                    "chunkOverlap": {
                        "title": "Chunk overlap",
                        "minimum": 0,
                        "type": "integer",
                        "description": "How much each chunk overlaps the previous one (in characters or tokens). Helps preserve context across boundaries.",
                        "default": 100
                    },
                    "unit": {
                        "title": "Unit",
                        "enum": [
                            "characters",
                            "tokens"
                        ],
                        "type": "string",
                        "description": "Whether chunk size and overlap are measured in characters or approximate tokens (~4 chars/token).",
                        "default": "characters"
                    },
                    "splitBy": {
                        "title": "Split by",
                        "enum": [
                            "paragraph",
                            "sentence",
                            "character"
                        ],
                        "type": "string",
                        "description": "Preferred boundary to split on before packing chunks.",
                        "default": "paragraph"
                    },
                    "clean": {
                        "title": "Clean text",
                        "type": "boolean",
                        "description": "Normalize whitespace and collapse excessive blank lines before splitting.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
