Markdown RAG Chunker
Pricing
$20.00/month + usage
Markdown RAG Chunker
Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.
Pricing
$20.00/month + usage
Rating
0.0
(0)
Developer
CodePoetry
Actor stats
0
Bookmarked
8
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
Markdown RAG Chunker turns PDFs, web pages, Word, Excel, PowerPoint, and Markdown files into clean, header-aware chunks ready for embeddings and vector databases. Built for AI engineers shipping RAG pipelines who want predictable splits, stable IDs, and token estimates without writing chunking code. Run the Actor in Apify Console.
What does Markdown RAG Chunker do?
- Multi-format input: convert PDF, HTML, DOCX, XLSX, PPTX, CSV, JSON, XML, EPUB, and plain text to Markdown before chunking
- Header-aware splitting: split by
#through######so each chunk keeps its parent section in metadata - Token-aware sizing: optional
max_chunk_charsre-splits only oversized sections, so retrieval stays focused - Deterministic chunk IDs: every chunk has a stable
chunk_idfor idempotent upserts and deduplication in Pinecone, Qdrant, Weaviate, pgvector, and others - Token count estimates:
token_countper chunk for embedding budget planning - Run telemetry:
metricspayload withinput_file_type,chunk_count, andelapsed_msfor monitoring - Two input modes: paste Markdown directly, or pass an HTTPS URL or
kvs://KEYfor any supported file - Pay only when you use it: pay-per-event pricing — no monthly rental
Behind the scenes the Actor uses Microsoft's markitdown for format conversion and LangChain's text splitters (MarkdownHeaderTextSplitter plus RecursiveCharacterTextSplitter) for chunking. You get production-grade defaults without managing the dependencies yourself.
Supported input formats
| Category | Formats |
|---|---|
| Markdown and text | .md, .txt |
| Web | HTML pages, single web page URLs |
.pdf | |
| Office | .docx, .xlsx, .xls, .pptx |
| Data | .csv, .json, .xml |
| Books | .epub |
Provide any of these as an HTTPS URL or as an Apify Key-Value Store record (kvs://KEY). MIME type is detected automatically from the response and reported back in metrics.input_file_type.
How to use Markdown RAG Chunker
- Open the Actor and choose
input_mode. Picktextto paste Markdown directly, orfileto load any supported document from a URL orkvs://KEY. - Set
headers_to_split_on. Most RAG pipelines work best with["#", "##", "###"]— that gives you section-level chunks while preserving page-level context in metadata. - Optionally set
max_chunk_chars(for example,1800characters or about 450 tokens) to cap oversized sections. Only chunks above the cap are re-split, so well-sized sections are preserved untouched. - Run the Actor and read results from the dataset. Pipe each item's
contentinto your embedding model and store thechunk_idandmetadataalongside the vector.
The full input form is documented under the Input tab. Run output schema and field types live under the Output tab.
Output format
{"chunks": [{"content": "Install the SDK with pip install ...","metadata": { "Header 1": "Guide", "Header 2": "Install" },"chunk_id": "f8b6be2adf7f6dbf","char_count": 124,"token_count": 31}],"metrics": {"input_mode": "file","input_file_type": "application/pdf","input_chars": 4281,"chunk_count": 12,"elapsed_ms": 184}}
| Field | Description |
|---|---|
content | Chunk text to send to your embedding model |
metadata | Header hierarchy (Header 1, Header 2, ...) for context-aware retrieval |
chunk_id | Stable 16-char ID for idempotent upserts and deduplication |
char_count | Character length of content |
token_count | Approximate token count (~1 token per 4 characters) |
metrics.input_file_type | Detected source MIME type (for example, application/pdf) |
metrics.chunk_count | Total chunks produced in the run |
metrics.elapsed_ms | End-to-end processing time |
How much does document chunking cost?
Markdown RAG Chunker uses pay-per-event pricing: you pay a small fixed amount only when a file is processed, with no monthly rental. Direct text input on the free tier is ideal for trying the chunker before wiring it into a pipeline. Detailed unit prices are listed under the Pricing tab.
Use cases
- RAG over documentation: split long product docs and changelogs into retrievable sections
- Knowledge bases: ingest internal PDFs and Word docs into a vector store with stable IDs
- Customer support search: chunk help center articles for semantic search
- AI agents: feed large reference documents to agents in budget-friendly slices
- Crawler post-processing: chain after Website Content Crawler to convert crawled pages into RAG-ready chunks
FAQ
How do I chunk a PDF for a vector database?
Set input_mode to file, paste the PDF URL into markdown_file, and run the Actor. The PDF is converted to Markdown via markitdown, split by header hierarchy, and each chunk gets a deterministic chunk_id you can use as the upsert key in Pinecone, Qdrant, Weaviate, or pgvector.
What is the difference between header-aware chunking and fixed-size chunking?
Fixed-size chunking cuts text every N characters or tokens, often slicing through paragraphs and losing structural context. Header-aware chunking splits on Markdown headings, so each chunk maps to a logical section and the parent header chain is kept in metadata. This gives retrieval models real context — a chunk about "Authentication" still knows it lives under the "API Reference" section.
Can I chain Markdown RAG Chunker with a web crawler?
Yes. A common pipeline is Website Content Crawler → Markdown RAG Chunker → embeddings → vector DB. The crawler produces clean Markdown for each page, and this Actor splits that Markdown into RAG-ready chunks with stable IDs.
Does it work with LangChain or LlamaIndex?
Yes. The chunking is built on LangChain text splitters, and the output (content + metadata) maps cleanly onto a LangChain Document or LlamaIndex Node. You can also use Apify's LangChain and LlamaIndex integrations directly inside both frameworks.
How accurate is token_count?
token_count is a fast estimate (~1 token per 4 characters) intended for budget planning and guardrails. For exact token counts, run your model's tokenizer over content after retrieval.
Where do I find API examples?
Use the API, Python, JavaScript, CLI, OpenAPI, and MCP tabs on this Actor's page — they include ready-to-paste code with the correct Actor ID and input shape for every supported client.
