Pricing

from $0.20 / 1,000 results

RAG Text Chunker — heading & sentence aware, Japanese ready

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Pricing

from $0.20 / 1,000 results

Rating

0.0

(0)

Developer

Shinobu Otani

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

RAG Text Chunker

Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready — deterministic, no LLM cost.

Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
Packs whole sentences up to max_chars; oversized sentences are hard-split as a last resort
Optional overlap between consecutive chunks for retrieval continuity
Japanese-aware boundaries: 。！？ with closing-quote handling alongside Latin .!? (decimals like 3.14 stay intact)
Heading breadcrumbs: every chunk carries heading_path for citation

Input

{"documents": ["# 概要\n\n検証は三段階で行う。まず再現する。"], "max_chars": 1500, "overlap": 200}

Output (one dataset item per chunk)

{"id": 0, "document_index": 0, "heading_path": ["概要"], "text": "検証は三段階で行う。 まず再現する。", "char_count": 19}

Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.

Text Chunker: Split Text & Documents into Chunks for RAG

raional/text-chunker

Split long text or documents into properly sized, sentence-aware chunks with overlap for embeddings, vector databases, and RAG pipelines. Choose recursive, sentence-boundary, or fixed-token chunking. Fetch from URLs or paste text directly. Powered by Chonkie.

Raion Al

Text Splitter & Chunker for RAG / LLMs

zenomastro/text-splitter-for-llm

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Rosario Vitale

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Nguyễn Anh Duy

4.7

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Tender RAG Chunker — Text Chunks for LLM

adobeflex/tender-rag-chunker

Chunk tender text for RAG/embeddings with stable ids and metadata.

Yahor

RAG Post Processor - Text Cleaner & Chunker for LLM Pipelines

jalicia/rag-post-processor

Clean and chunk scraped text for RAG and LLM pipelines. Strips HTML, collapses whitespace, splits into overlapping chunks ready for embedding. Works standalone or chained after any scraper. Per-row billing.

Jordan Wagner

RAG Docs Extractor - Documentation to Chunks

ambitious_door/ragdocs-extractor

Turn any documentation site into clean, RAG-ready chunks in a single call. Semantic boundaries, preserved structure, per-chunk metadata.

C. K.

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.