RAG Text Chunker — heading & sentence aware, Japanese ready avatar

RAG Text Chunker — heading & sentence aware, Japanese ready

Pricing

Pay per usage

Go to Apify Store
RAG Text Chunker — heading & sentence aware, Japanese ready

RAG Text Chunker — heading & sentence aware, Japanese ready

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

RAG Text Chunker

Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready — deterministic, no LLM cost.

  • Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
  • Packs whole sentences up to max_chars; oversized sentences are hard-split as a last resort
  • Optional overlap between consecutive chunks for retrieval continuity
  • Japanese-aware boundaries: 。!? with closing-quote handling alongside Latin .!? (decimals like 3.14 stay intact)
  • Heading breadcrumbs: every chunk carries heading_path for citation

Input

{"documents": ["# 概要\n\n検証は三段階で行う。まず再現する。"], "max_chars": 1500, "overlap": 200}

Output (one dataset item per chunk)

{"id": 0, "document_index": 0, "heading_path": ["概要"], "text": "検証は三段階で行う。 まず再現する。", "char_count": 19}

Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.