RAG-Ready Markdown Converter & Chunker avatar

RAG-Ready Markdown Converter & Chunker

Pricing

from $0.01 / 1,000 results

Go to Apify Store
RAG-Ready Markdown Converter & Chunker

RAG-Ready Markdown Converter & Chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Pricing

from $0.01 / 1,000 results

Rating

4.7

(3)

Developer

Nguyễn Anh Duy

Nguyễn Anh Duy

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

0

Monthly active users

2 days ago

Last modified

Share

Run on Apify Apify Marketplace GitHub Repo

Standard mode: HTML → clean Markdown → character-based chunks with qualityScore, contentHash, and codeBlocks.
Enterprise mode (no extra charge): token-aware chunking + OpenAI Embedding + Pinecone / Qdrant auto-upsert.
NEW: PDF/DOCX parsing via fileUrls input (binary, zero-DOM).
Same $0.01/1k price.


Quick Comparison

FeatureStandardEnterprise
HTML → Clean Markdown
URL Fetching (auto-download HTML)
Character-based chunking
Semantic chunking (heading-aware)
Token-aware chunking (cl100k_base)
Natural boundary detection
Configurable overlap
Embeddings via text-embedding-3-small
Pinecone auto-upsert
Qdrant auto-upsert
Bulk processing
JSONL export (LLM-ready)
Zero DOM / no browser
Price$0.01/1k$0.01/1k

Why this exists

Hundreds of Apify crawlers output raw HTML full of nav bars, footers, scripts, and ads. Feeding that into a Vector DB or LLM wastes tokens and pollutes embeddings. This Actor takes any already-crawled content and delivers production-ready chunks — with or without a Vector DB pipeline.


Features

  • HTML → Clean Markdown — strips scripts, styles, nav, footer, iframes, SVG, canvas, and comment garbage; converts headers, lists, tables, blockquotes, links, images, bold, italic, code into proper Markdown syntax.
  • Smart Chunking — splits by natural boundaries (paragraph breaks, headers) with configurable overlap to preserve context; avoids cutting words mid-stream.
  • Token-Aware Chunking (Enterprise) — uses js-tiktoken (cl100k_base) to split by actual LLM tokens instead of characters. Compatible with GPT-4, GPT-3.5, text-embedding-3-small.
  • Pinecone + Qdrant Auto-Upsert (Enterprise) — generates embeddings via OpenAI text-embedding-3-small and upserts vectors directly to your Pinecone index or Qdrant collection. No glue code needed. Auto-detects which vector DB to use from your input.
  • Bulk Processing — accepts an array of HTML documents and processes each independently with per-record chunk settings.
  • URL Fetching — provide an array of URLs; the Actor automatically fetches and processes each one through the full pipeline.
  • Semantic Chunking (updated in v1.4) — heading-aware chunking that respects document structure. Splits on # headings, keeps related content together, preserves heading context in chunk metadata. Auto-selected for content >5000 characters.
  • JSONL Export — download chunks as JSONL (one JSON object per line) for direct LLM fine-tuning, embedding batch jobs, or LangChain/LlamaIndex ingestion.
  • Zero DOM dependency — pure string processing; runs on any Node.js platform without a browser or headless client.
  • MCP / AI Agent Ready — callable via API; JSON output integrates directly with LangChain, LlamaIndex, Haystack, or custom RAG pipelines.
  • Backward Compatible — Enterprise mode activates only when you provide API keys. Standard mode works exactly as before.

How Enterprise Mode Works

HTML → Clean Markdown → Token-aware chunking (cl100k_base) → OpenAI Embedding → Pinecone or Qdrant upsert

Provide OpenAI API key + (Pinecone keys or Qdrant config) → the Actor auto-detects and runs the full pipeline. No configuration, no middleware, no extra services. You can even push to both Pinecone and Qdrant simultaneously by providing all keys.


Use Cases

WhoWhy
RAG Pipeline BuildersConvert scraped pages → chunks → embeddings → Vector DB
LLM Fine-tuningClean training data by removing structural HTML garbage
AI AgentsFeed clean Markdown context to tool-calling LLMs
Content AnalystsExtract structured text from raw website dumps

Input

Standard Mode

FieldTypeDefaultDescription
htmlContentstringRaw HTML or text content to process
urlsarray[]URLs to auto-fetch and process (alternative to htmlContent)
chunkingStrategystringautoauto, character, or semantic. Semantic respects heading boundaries
chunkSizeinteger1000Target chunk length in characters or tokens
chunkOverlapinteger200Overlap between consecutive chunks (character mode only)
modestringbothOutput mode: both, markdown, or chunks
inputRecordsarray[]Bulk input [{ id, html, chunkSize?, chunkOverlap? }]
deduplicatebooleanfalseSkip records with identical content hash
minQualityScoreinteger0Minimum quality score (0–100); skip low-quality content
embeddingModelstringtext-embedding-3-smallOpenAI embedding model for vector DB pipeline
batchSizeinteger50Max records per embedding batch (Enterprise mode)

Enterprise Mode

Provide OpenAI key + either Pinecone or Qdrant config to activate the full pipeline.

Pinecone Pipeline

FieldTypeDescription
openaiApiKeystring (secret)Your OpenAI API key for token chunking + embeddings
pineconeApiKeystring (secret)Your Pinecone API key
pineconeIndexstringYour Pinecone index name (must be 1536-dimension)

Qdrant Pipeline

FieldTypeDescription
openaiApiKeystring (secret)Same key — shared across all Enterprise features
qdrantUrlstringYour Qdrant instance URL (e.g. https://xxx.us-east-1-0.aws.cloud.qdrant.io)
qdrantApiKeystring (secret)Your Qdrant API key
qdrantCollectionstringYour Qdrant collection name (1536-dimension vectors)

When Enterprise fields are detected, the Actor automatically:

  1. Switches from character-based to token-aware chunking (cl100k_base)
  2. Generates 1536-dimension embeddings via text-embedding-3-small
  3. Upserts vectors to Pinecone and/or Qdrant with metadata (chunkIndex, tokenCount, source text)

The Actor auto-detects which vector DB to use:

  • Pinecone → requires openaiApiKey + pineconeApiKey + pineconeIndex
  • Qdrant → requires openaiApiKey + qdrantUrl + qdrantApiKey + qdrantCollection
  • Both → provide all keys; runs both pipelines simultaneously

Where to get the keys

KeyHow to get
openaiApiKeyplatform.openai.com/api-keys — create a new secret key
pineconeApiKeyapp.pinecone.io → API Keys → Copy
pineconeIndexCreate a serverless index with dimension 1536 (matching text-embedding-3-small)
qdrantUrlcloud.qdrant.io → Clusters → REST API Endpoint
qdrantApiKeycloud.qdrant.io → Clusters → API Key
qdrantCollectionCreate a collection with dimension 1536 and Cosine distance

Output

Each processed record returns:

FieldTypeDescription
recordIdstringID of the processed record
statusstringok or empty
rawMarkdownstringCleaned Markdown (if mode includes markdown or both)
chunksarrayArray of { chunkIndex, content, characterCount, tokenCount?, headingPath?, chunkType?, contentHash?, qualityScore?, codeBlocks? }
statsobject{ rawChars, cleanedChars, totalChunks, chunkSize, chunkOverlap, chunkingMode, avgQualityScore?, totalTokens? }

A summary entry is appended at the end with aggregate statistics across all records.


Pricing

Pay Per Event — $0.01 per 1,000 results.

One result = one processed record (not per chunk). Processing 5 records with 200 total chunks = 5 billable results. Enterprise mode costs the same — you only pay OpenAI and Pinecone directly for their API usage.


Examples

Standard Mode

Input:

{
"htmlContent": "<html><body><h1>Hello World</h1><p>This is <strong>important</strong> content.</p></body></html>",
"chunkSize": 500,
"chunkOverlap": 50
}

Output:

{
"recordId": "default",
"status": "ok",
"rawMarkdown": "# Hello World\n\nThis is **important** content.",
"chunks": [
{ "chunkIndex": 0, "content": "# Hello World\n\nThis is **important** content.", "characterCount": 49 }
],
"stats": { "rawChars": 107, "cleanedChars": 49, "totalChunks": 1, "chunkSize": 500, "chunkOverlap": 50, "chunkingMode": "character" }
}

Enterprise Mode (Token Chunking + Pinecone)

Input:

{
"htmlContent": "<h1>Enterprise RAG Pipeline</h1><p>Your HTML content here...</p>",
"chunkSize": 500,
"chunkOverlap": 100,
"openaiApiKey": "sk-...",
"pineconeApiKey": "pc-...",
"pineconeIndex": "my-rag-index"
}

Output includes:

{
"stats": { "totalChunks": 12, "totalTokens": 482, "chunkingMode": "token" },
"chunks": [
{ "chunkIndex": 0, "content": "...", "tokenCount": 42, "characterCount": 185 }
]
}

Vectors are upserted to Pinecone automatically.

Enterprise Mode (Token Chunking + Qdrant)

Input:

{
"htmlContent": "<h1>Enterprise RAG Pipeline</h1><p>Your HTML content here...</p>",
"chunkSize": 500,
"chunkOverlap": 100,
"openaiApiKey": "sk-...",
"qdrantUrl": "https://xxx.us-east-1-0.aws.cloud.qdrant.io",
"qdrantApiKey": "qdrant-...",
"qdrantCollection": "my-rag-collection"
}

Vectors are upserted to Qdrant automatically via REST API (no SDK required).


Usage with AI Agents / MCP

// POST https://api.apify.com/v2/acts/foxpink~apify-rag-markdown-chunker/runs?token=YOUR_API_TOKEN
{
"htmlContent": "<h1>Your HTML here</h1>",
"chunkSize": 1000,
"chunkOverlap": 200
}

Note: This tool operates on already-crawled content. Use with any Apify Web Scraper, Puppeteer, or browser-based Actor by piping its output into this Actor's inputRecords.


Compatibility

  • 100% Node.js (18+)
  • No browser, no headless, no DOM
  • ESM (ECMAScript Modules)