Rag Content Chunker avatar

Rag Content Chunker

Pricing

from $0.50 / 1,000 results

Go to Apify Store
Rag Content Chunker

Rag Content Chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Mick

Mick

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

4 days ago

Last modified

Share

Apify Actor that splits text and Markdown into optimally-sized, token-counted chunks for RAG pipelines. Supports recursive, Markdown-aware, and sentence-based chunking strategies. Outputs flat chunk objects with deterministic IDs, ready for embedding and vector DB ingestion. MCP-ready for AI agent integration.

Features

  • Three chunking strategies: recursive (general), markdown (header-aware), sentence (boundary-preserving)
  • Token-aware splitting using tiktoken cl100k_base (compatible with OpenAI embeddings and GPT-4)
  • Deterministic chunk IDs (SHA-256) for incremental vector DB updates
  • Two input modes: direct text or dataset chaining from any crawler
  • Dot-notation support for nested dataset fields (e.g., metadata.content)
  • Configurable chunk size (64-8192 tokens) and overlap (0-2048 tokens)
  • Input validation and sanitization (size limits, control char stripping, injection prevention)
  • No external API calls, no API keys required -- pure local computation

Requirements

  • Python 3.11+
  • Apify platform account (for running as Actor)

Install dependencies:

$pip install -r requirements.txt

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

  • text (string, optional) -- plain text or Markdown to chunk, max 500,000 characters
  • dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Takes priority over text
  • dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
  • strategy (string, optional) -- "recursive" (default), "markdown", or "sentence"
  • chunk_size (integer, optional) -- target chunk size in tokens, 64-8192. Default: 512
  • chunk_overlap (integer, optional) -- overlapping tokens between chunks, 0-2048. Default: 64

At least one of text or dataset_id must be provided.

Usage

Local (CLI)

$APIFY_TOKEN=your-token apify run

Direct Text Input

{
"text": "# Introduction\n\nThis is a sample document with multiple sections.\n\n## Section One\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n## Section Two\n\nUt enim ad minim veniam, quis nostrud exercitation.",
"strategy": "markdown",
"chunk_size": 256,
"chunk_overlap": 32
}

Dataset Chaining (from Website Content Crawler)

{
"dataset_id": "abc123XYZ",
"dataset_field": "text",
"strategy": "recursive",
"chunk_size": 512,
"chunk_overlap": 64
}

Example Output

Each chunk is a separate dataset item:

{
"chunk_id": "a1b2c3d4e5f67890",
"chunk_index": 0,
"text": "# Introduction\n\nThis is a sample document with multiple sections.",
"token_count": 12,
"source_url": "https://example.com/page",
"page_title": "Example Page",
"section_heading": "Introduction"
}

A summary item is appended at the end:

{
"_summary": true,
"total_chunks": 3,
"total_tokens": 847,
"strategy": "markdown",
"chunk_size": 256,
"chunk_overlap": 32,
"processing_time": 0.142,
"billing": {
"total_chunks": 3,
"amount": 0.0015,
"rate_per_chunk": 0.0005
}
}

Pipeline Position

This actor fills the chunking step in a standard RAG pipeline:

Crawl (Website Content Crawler, 101K+ users)
-> Clean (optional preprocessing)
-> Chunk (this actor)
-> Embed (OpenAI, Cohere, etc.)
-> Store (Pinecone, Qdrant, Weaviate integrations)

Chunking Strategy Guide

StrategyBest ForHow It Splits
recursiveGeneral text, mixed contentParagraphs -> sentences -> words -> hard token cuts
markdownDocumentation, crawled web pagesMarkdown headers (h1-h6), preserves section structure
sentenceConversational content, Q&A, proseSentence boundaries, preserves sentence integrity

Choosing chunk_size

  • 256 tokens -- fine-grained retrieval, higher precision, more chunks
  • 512 tokens (default) -- balanced for most RAG use cases
  • 1024 tokens -- broader context per chunk, fewer chunks, good for summarization
  • 2048+ tokens -- large context windows, best with newer embedding models

Architecture

  • src/agent/main.py -- Actor entry point, input handling, dataset chaining, output
  • src/agent/chunker.py -- Core chunking engine with three strategies and token counting
  • src/agent/validation.py -- Input validation, sanitization, and security checks
  • src/agent/pricing.py -- PPE billing calculator ($0.0005/chunk)
  • skill.md -- Machine-readable skill contract for agent discovery

Security

  • Size limits: 500K chars max text, 10K max dataset items, bounded chunk parameters
  • Sanitization: Strips null bytes and control characters (preserves newlines/tabs for Markdown)
  • Injection prevention: Dataset IDs and field names validated against strict regex patterns
  • No LLM calls: Pure text processing, zero prompt injection surface
  • No secrets: Actor requires no API keys or credentials
  • No network calls: All processing is local computation

Pricing

Pay-Per-Event (PPE): $0.0005 per chunk ($0.50 per 1,000 chunks).

Content SizeApprox. ChunksCost
Single blog post10-20$0.005-$0.01
10-page website50-100$0.025-$0.05
100-page docs site500-1,000$0.25-$0.50
Large knowledge base5,000-10,000$2.50-$5.00

Troubleshooting

  • "No input provided": Supply either text or dataset_id
  • "Text exceeds maximum length": Split content into batches under 500K chars, or use dataset mode
  • "Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 chars
  • "chunk_overlap must be less than chunk_size": Reduce overlap or increase chunk size
  • "No chunks produced": Input text may be empty or contain only whitespace/control characters
  • Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.


MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

  • Endpoint: https://mcp.apify.com?tools=labrat011/rag-content-chunker
  • Auth: Authorization: Bearer <APIFY_TOKEN>
  • Transport: Streamable HTTP
  • Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
"mcpServers": {
"rag-content-chunker": {
"url": "https://mcp.apify.com?tools=labrat011/rag-content-chunker",
"headers": {
"Authorization": "Bearer <APIFY_TOKEN>"
}
}
}
}

AI agents can use this actor to split text and documents into optimally-sized chunks for RAG pipelines, prepare content for embedding, and build retrieval-ready datasets -- all as a callable MCP tool.