Deprecated

Pricing

$20.00/month + usage

See alternative Actors

Go to Apify Store

Markdown RAG Chunker

Deprecated

See alternative Actors

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

Pricing

$20.00/month + usage

Rating

0.0

(0)

Developer

CodePoetry

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What does Markdown RAG Chunker do?

Multi-format input: convert PDF, HTML, DOCX, XLSX, PPTX, CSV, JSON, XML, EPUB, and plain text to Markdown before chunking
Header-aware splitting: split by # through ###### so each chunk keeps its parent section in metadata
Token-aware sizing: optional max_chunk_chars re-splits only oversized sections, so retrieval stays focused
Deterministic chunk IDs: every chunk has a stable chunk_id for idempotent upserts and deduplication in Pinecone, Qdrant, Weaviate, pgvector, and others
Token count estimates: token_count per chunk for embedding budget planning
Run telemetry: metrics payload with input_file_type, chunk_count, and elapsed_ms for monitoring
Two input modes: paste Markdown directly, or pass an HTTPS URL or kvs://KEY for any supported file
Pay only when you use it: pay-per-event pricing — no monthly rental

Behind the scenes the Actor uses Microsoft's markitdown for format conversion and LangChain's text splitters (MarkdownHeaderTextSplitter plus RecursiveCharacterTextSplitter) for chunking. You get production-grade defaults without managing the dependencies yourself.

Supported input formats

Category	Formats
Markdown and text	`.md`, `.txt`
Web	HTML pages, single web page URLs
PDF	`.pdf`
Office	`.docx`, `.xlsx`, `.xls`, `.pptx`
Data	`.csv`, `.json`, `.xml`
Books	`.epub`

Provide any of these as an HTTPS URL or as an Apify Key-Value Store record (kvs://KEY). MIME type is detected automatically from the response and reported back in metrics.input_file_type.

How to use Markdown RAG Chunker

Open the Actor and choose input_mode. Pick text to paste Markdown directly, or file to load any supported document from a URL or kvs://KEY.
Set headers_to_split_on. Most RAG pipelines work best with ["#", "##", "###"] — that gives you section-level chunks while preserving page-level context in metadata.
Optionally set max_chunk_chars (for example, 1800 characters or about 450 tokens) to cap oversized sections. Only chunks above the cap are re-split, so well-sized sections are preserved untouched.
Run the Actor and read results from the dataset. Pipe each item's content into your embedding model and store the chunk_id and metadata alongside the vector.

The full input form is documented under the Input tab. Run output schema and field types live under the Output tab.

Output format

{
  "chunks": [
    {
      "content": "Install the SDK with pip install ...",
      "metadata": { "Header 1": "Guide", "Header 2": "Install" },
      "chunk_id": "f8b6be2adf7f6dbf",
      "char_count": 124,
      "token_count": 31
    }
  ],
  "metrics": {
    "input_mode": "file",
    "input_file_type": "application/pdf",
    "input_chars": 4281,
    "chunk_count": 12,
    "elapsed_ms": 184
  }
}

Field	Description
`content`	Chunk text to send to your embedding model
`metadata`	Header hierarchy (`Header 1`, `Header 2`, ...) for context-aware retrieval
`chunk_id`	Stable 16-char ID for idempotent upserts and deduplication
`char_count`	Character length of `content`
`token_count`	Approximate token count (~1 token per 4 characters)
`metrics.input_file_type`	Detected source MIME type (for example, `application/pdf`)
`metrics.chunk_count`	Total chunks produced in the run
`metrics.elapsed_ms`	End-to-end processing time

How much does document chunking cost?

Markdown RAG Chunker uses pay-per-event pricing: you pay a small fixed amount only when a file is processed, with no monthly rental. Direct text input on the free tier is ideal for trying the chunker before wiring it into a pipeline. Detailed unit prices are listed under the Pricing tab.

Use cases

RAG over documentation: split long product docs and changelogs into retrievable sections
Knowledge bases: ingest internal PDFs and Word docs into a vector store with stable IDs
Customer support search: chunk help center articles for semantic search
AI agents: feed large reference documents to agents in budget-friendly slices
Crawler post-processing: chain after Website Content Crawler to convert crawled pages into RAG-ready chunks

FAQ

How do I chunk a PDF for a vector database?

Set input_mode to file, paste the PDF URL into markdown_file, and run the Actor. The PDF is converted to Markdown via markitdown, split by header hierarchy, and each chunk gets a deterministic chunk_id you can use as the upsert key in Pinecone, Qdrant, Weaviate, or pgvector.

What is the difference between header-aware chunking and fixed-size chunking?

Fixed-size chunking cuts text every N characters or tokens, often slicing through paragraphs and losing structural context. Header-aware chunking splits on Markdown headings, so each chunk maps to a logical section and the parent header chain is kept in metadata. This gives retrieval models real context — a chunk about "Authentication" still knows it lives under the "API Reference" section.

Can I chain Markdown RAG Chunker with a web crawler?

Yes. A common pipeline is Website Content Crawler → Markdown RAG Chunker → embeddings → vector DB. The crawler produces clean Markdown for each page, and this Actor splits that Markdown into RAG-ready chunks with stable IDs.

Does it work with LangChain or LlamaIndex?

Yes. The chunking is built on LangChain text splitters, and the output (content + metadata) maps cleanly onto a LangChain Document or LlamaIndex Node. You can also use Apify's LangChain and LlamaIndex integrations directly inside both frameworks.

How accurate is `token_count`?

token_count is a fast estimate (~1 token per 4 characters) intended for budget planning and guardrails. For exact token counts, run your model's tokenizer over content after retrieval.

Where do I find API examples?

Use the API, Python, JavaScript, CLI, OpenAPI, and MCP tabs on this Actor's page — they include ready-to-paste code with the correct Actor ID and input shape for every supported client.

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Nguyễn Anh Duy

4.7

(3)

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Shinobu Otani

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

mick_

HTML to Markdown — clean conversion, boilerplate stripping

shoebill-dev27/html-to-markdown

Convert scraped HTML into clean Markdown and plain text: headings, nested lists, links, images, code blocks, blockquotes, and tables. Drops scripts, styles, and structural boilerplate (nav/footer/aside) so only content remains. Pure parsing, no LLM cost.

Shinobu Otani

RagChunk: Semantic RAG Splitter & Markdown Cleaner

flaretool/rag-semantic-chunker

Extract, clean, and semantically split documentation URLs into token-safe chunks for Vector DBs. Built-in HTML boilerplate stripping saves up to 40% on LLM token costs.

FlareTool

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

Adam

PDF to Markdown for RAG & LLMs (tables, no API key needed)

vivid_astronaut/pdf-to-markdown-for-rag

Convert PDFs to clean structured Markdown for RAG and LLM ingestion. Preserves tables. Flat $0.004/page, no API key required.

Fabio Suizu

RSS & Atom Feeds to RAG Markdown Chunks

awesome_highboy/rss-news-structurer

Turn RSS/Atom feeds into full-article clean Markdown + token-bounded RAG chunks for embeddings & vector DBs. sha256 cross-item dedup means you pay only for net-new articles, not syndicated copies. Robots honored; ownership-gated.

Adam

PDF → RAG Chunks (Token-Aware, Vector-Ready)

gochujang/pdf-rag-chunker

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

Hojun Lee

RAG Pipeline

labrat011/rag-pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

mick_