Pricing

from $0.50 / 1,000 results

Rag Content Chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Mick

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

Features

Three chunking strategies: recursive (general), markdown (header-aware), sentence (boundary-preserving)
Token-aware splitting using tiktoken cl100k_base (compatible with OpenAI embeddings and GPT-4)
Deterministic chunk IDs (SHA-256) for incremental vector DB updates
Two input modes: direct text or dataset chaining from any crawler
Dot-notation support for nested dataset fields (e.g., metadata.content)
Configurable chunk size (64-8192 tokens) and overlap (0-2048 tokens)
Input validation and sanitization (size limits, control char stripping, injection prevention)
No external API calls, no API keys required -- pure local computation

Requirements

Python 3.11+
Apify platform account (for running as Actor)

Install dependencies:

$pip install -r requirements.txt

Configuration

Actor Inputs

Defined in .actor/INPUT_SCHEMA.json:

text (string, optional) -- plain text or Markdown to chunk, max 500,000 characters
dataset_id (string, optional) -- Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Takes priority over text
dataset_field (string, optional) -- field to read from each dataset item. Default: "text". Supports dot notation
strategy (string, optional) -- "recursive" (default), "markdown", or "sentence"
chunk_size (integer, optional) -- target chunk size in tokens, 64-8192. Default: 512
chunk_overlap (integer, optional) -- overlapping tokens between chunks, 0-2048. Default: 64

At least one of text or dataset_id must be provided.

Usage

Local (CLI)

$APIFY_TOKEN=your-token apify run

Direct Text Input

{
  "text": "# Introduction\n\nThis is a sample document with multiple sections.\n\n## Section One\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n## Section Two\n\nUt enim ad minim veniam, quis nostrud exercitation.",
  "strategy": "markdown",
  "chunk_size": 256,
  "chunk_overlap": 32
}

Dataset Chaining (from Website Content Crawler)

{
  "dataset_id": "abc123XYZ",
  "dataset_field": "text",
  "strategy": "recursive",
  "chunk_size": 512,
  "chunk_overlap": 64
}

Example Output

Each chunk is a separate dataset item:

{
  "chunk_id": "a1b2c3d4e5f67890",
  "chunk_index": 0,
  "text": "# Introduction\n\nThis is a sample document with multiple sections.",
  "token_count": 12,
  "source_url": "https://example.com/page",
  "page_title": "Example Page",
  "section_heading": "Introduction"
}

A summary item is appended at the end:

{
  "_summary": true,
  "total_chunks": 3,
  "total_tokens": 847,
  "strategy": "markdown",
  "chunk_size": 256,
  "chunk_overlap": 32,
  "processing_time": 0.142,
  "billing": {
    "total_chunks": 3,
    "amount": 0.0015,
    "rate_per_chunk": 0.0005
  }
}

Pipeline Position

This actor fills the chunking step in a standard RAG pipeline:

Crawl (Website Content Crawler, 101K+ users)
  -> Clean (optional preprocessing)
    -> Chunk (this actor)
      -> Embed (OpenAI, Cohere, etc.)
        -> Store (Pinecone, Qdrant, Weaviate integrations)

Chunking Strategy Guide

Strategy	Best For	How It Splits
`recursive`	General text, mixed content	Paragraphs -> sentences -> words -> hard token cuts
`markdown`	Documentation, crawled web pages	Markdown headers (h1-h6), preserves section structure
`sentence`	Conversational content, Q&A, prose	Sentence boundaries, preserves sentence integrity

Choosing chunk_size

256 tokens -- fine-grained retrieval, higher precision, more chunks
512 tokens (default) -- balanced for most RAG use cases
1024 tokens -- broader context per chunk, fewer chunks, good for summarization
2048+ tokens -- large context windows, best with newer embedding models

Architecture

src/agent/main.py -- Actor entry point, input handling, dataset chaining, output
src/agent/chunker.py -- Core chunking engine with three strategies and token counting
src/agent/validation.py -- Input validation, sanitization, and security checks
src/agent/pricing.py -- PPE billing calculator ($0.0005/chunk)
skill.md -- Machine-readable skill contract for agent discovery

Security

Size limits: 500K chars max text, 10K max dataset items, bounded chunk parameters
Sanitization: Strips null bytes and control characters (preserves newlines/tabs for Markdown)
Injection prevention: Dataset IDs and field names validated against strict regex patterns
No LLM calls: Pure text processing, zero prompt injection surface
No secrets: Actor requires no API keys or credentials
No network calls: All processing is local computation

Pricing

Pay-Per-Event (PPE): $0.0005 per chunk ($0.50 per 1,000 chunks).

Content Size	Approx. Chunks	Cost
Single blog post	10-20	$0.005-$0.01
10-page website	50-100	$0.025-$0.05
100-page docs site	500-1,000	$0.25-$0.50
Large knowledge base	5,000-10,000	$2.50-$5.00

Troubleshooting

"No input provided": Supply either text or dataset_id
"Text exceeds maximum length": Split content into batches under 500K chars, or use dataset mode
"Invalid dataset_id format": Must be alphanumeric with hyphens/underscores, 1-64 chars
"chunk_overlap must be less than chunk_size": Reduce overlap or increase chunk size
"No chunks produced": Input text may be empty or contain only whitespace/control characters
Dataset errors: Verify the dataset ID exists and the actor has access to it

License

See LICENSE file for details.

MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

Endpoint: https://mcp.apify.com?tools=labrat011/rag-content-chunker
Auth: Authorization: Bearer <APIFY_TOKEN>
Transport: Streamable HTTP
Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
    "mcpServers": {
        "rag-content-chunker": {
            "url": "https://mcp.apify.com?tools=labrat011/rag-content-chunker",
            "headers": {
                "Authorization": "Bearer <APIFY_TOKEN>"
            }
        }
    }
}

AI agents can use this actor to split text and documents into optimally-sized chunks for RAG pipelines, prepare content for embedding, and build retrieval-ready datasets -- all as a callable MCP tool.

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Mick

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

RAG Markdown Cleaner

cooldev/rag-markdown-cleaner

Transform web pages into RAG-ready Markdown with smart chunking, metadata, code detection & quality scoring. Production-tested deduplication. Fully open-source (Apache 2.0)—review code, contribute, or self-host. Turn messy HTML into embedding-ready knowledge instantly.

Mohamed khalil Zouitni

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Universal RAG Web Scraper

express_kingfisher/rag-web-scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Prince Raj

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

473

2.1

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.