Rag Content Chunker
Pricing
from $0.50 / 1,000 results
Rag Content Chunker
Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer

Mick
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
4 days ago
Last modified
Categories
Share
Apify Actor that splits text and Markdown into optimally-sized, token-counted chunks for RAG pipelines. Supports recursive, Markdown-aware, and sentence-based chunking strategies. Outputs flat chunk objects with deterministic IDs, ready for embedding and vector DB ingestion. MCP-ready for AI agent integration.
Features
- Three chunking strategies: recursive (general), markdown (header-aware), sentence (boundary-preserving)
- Token-aware splitting using tiktoken cl100k_base (compatible with OpenAI embeddings and GPT-4)
- Deterministic chunk IDs (SHA-256) for incremental vector DB updates
- Two input modes: direct text or dataset chaining from any crawler
- Dot-notation support for nested dataset fields (e.g.,
metadata.content) - Configurable chunk size (64-8192 tokens) and overlap (0-2048 tokens)
- Input validation and sanitization (size limits, control char stripping, injection prevention)
- No external API calls, no API keys required -- pure local computation
Requirements
- Python 3.11+
- Apify platform account (for running as Actor)
Install dependencies:
$pip install -r requirements.txt
Configuration
Actor Inputs
Defined in .actor/INPUT_SCHEMA.json:
text(string, optional) -- plain text or Markdown to chunk, max 500,000 charactersdataset_id(string, optional) -- Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Takes priority overtextdataset_field(string, optional) -- field to read from each dataset item. Default:"text". Supports dot notationstrategy(string, optional) --"recursive"(default),"markdown", or"sentence"chunk_size(integer, optional) -- target chunk size in tokens, 64-8192. Default: 512chunk_overlap(integer, optional) -- overlapping tokens between chunks, 0-2048. Default: 64
At least one of text or dataset_id must be provided.
Usage
Local (CLI)
$APIFY_TOKEN=your-token apify run
Direct Text Input
{"text": "# Introduction\n\nThis is a sample document with multiple sections.\n\n## Section One\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n## Section Two\n\nUt enim ad minim veniam, quis nostrud exercitation.","strategy": "markdown","chunk_size": 256,"chunk_overlap": 32}
Dataset Chaining (from Website Content Crawler)
{"dataset_id": "abc123XYZ","dataset_field": "text","strategy": "recursive","chunk_size": 512,"chunk_overlap": 64}
Example Output
Each chunk is a separate dataset item:
{"chunk_id": "a1b2c3d4e5f67890","chunk_index": 0,"text": "# Introduction\n\nThis is a sample document with multiple sections.","token_count": 12,"source_url": "https://example.com/page","page_title": "Example Page","section_heading": "Introduction"}
A summary item is appended at the end:
{"_summary": true,"total_chunks": 3,"total_tokens": 847,"strategy": "markdown","chunk_size": 256,"chunk_overlap": 32,"processing_time": 0.142,"billing": {"total_chunks": 3,"amount": 0.0015,"rate_per_chunk": 0.0005}}
Pipeline Position
This actor fills the chunking step in a standard RAG pipeline:
Crawl (Website Content Crawler, 101K+ users)-> Clean (optional preprocessing)-> Chunk (this actor)-> Embed (OpenAI, Cohere, etc.)-> Store (Pinecone, Qdrant, Weaviate integrations)
Chunking Strategy Guide
| Strategy | Best For | How It Splits |
|---|---|---|
recursive | General text, mixed content | Paragraphs -> sentences -> words -> hard token cuts |
markdown | Documentation, crawled web pages | Markdown headers (h1-h6), preserves section structure |
sentence | Conversational content, Q&A, prose | Sentence boundaries, preserves sentence integrity |
Choosing chunk_size
- 256 tokens -- fine-grained retrieval, higher precision, more chunks
- 512 tokens (default) -- balanced for most RAG use cases
- 1024 tokens -- broader context per chunk, fewer chunks, good for summarization
- 2048+ tokens -- large context windows, best with newer embedding models
Architecture
src/agent/main.py-- Actor entry point, input handling, dataset chaining, outputsrc/agent/chunker.py-- Core chunking engine with three strategies and token countingsrc/agent/validation.py-- Input validation, sanitization, and security checkssrc/agent/pricing.py-- PPE billing calculator ($0.0005/chunk)skill.md-- Machine-readable skill contract for agent discovery
Security
- Size limits: 500K chars max text, 10K max dataset items, bounded chunk parameters
- Sanitization: Strips null bytes and control characters (preserves newlines/tabs for Markdown)
- Injection prevention: Dataset IDs and field names validated against strict regex patterns
- No LLM calls: Pure text processing, zero prompt injection surface
- No secrets: Actor requires no API keys or credentials
- No network calls: All processing is local computation
Pricing
Pay-Per-Event (PPE): $0.0005 per chunk ($0.50 per 1,000 chunks).
| Content Size | Approx. Chunks | Cost |
|---|---|---|
| Single blog post | 10-20 | $0.005-$0.01 |
| 10-page website | 50-100 | $0.025-$0.05 |
| 100-page docs site | 500-1,000 | $0.25-$0.50 |
| Large knowledge base | 5,000-10,000 | $2.50-$5.00 |
Troubleshooting
- "No input provided": Supply either
textordataset_id - "Text exceeds maximum length": Split content into batches under 500K chars, or use dataset mode
- "Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 chars
- "chunk_overlap must be less than chunk_size": Reduce overlap or increase chunk size
- "No chunks produced": Input text may be empty or contain only whitespace/control characters
- Dataset errors: Verify the dataset ID exists and the actor has access to it
License
See LICENSE file for details.
MCP Integration
This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.
- Endpoint:
https://mcp.apify.com?tools=labrat011/rag-content-chunker - Auth:
Authorization: Bearer <APIFY_TOKEN> - Transport: Streamable HTTP
- Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI
Example MCP config (Claude Desktop / Cursor):
{"mcpServers": {"rag-content-chunker": {"url": "https://mcp.apify.com?tools=labrat011/rag-content-chunker","headers": {"Authorization": "Bearer <APIFY_TOKEN>"}}}}
AI agents can use this actor to split text and documents into optimally-sized chunks for RAG pipelines, prepare content for embedding, and build retrieval-ready datasets -- all as a callable MCP tool.