Pricing

from $0.01 / 1,000 results

Try for free

Go to Apify Store

RAG-Ready Markdown Converter & Chunker

Try for free

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Pricing

from $0.01 / 1,000 results

Rating

4.7

(3)

Developer

Nguyễn Anh Duy

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Quick Comparison

Feature	Standard	Enterprise
HTML → Clean Markdown	✅	✅
URL Fetching (auto-download HTML)	✅	✅
Character-based chunking	✅	—
Semantic chunking (heading-aware)	✅	✅
Token-aware chunking (cl100k_base)	—	✅
Natural boundary detection	✅	✅
Configurable overlap	✅	✅
Embeddings via text-embedding-3-small	—	✅
Pinecone auto-upsert	—	✅
Qdrant auto-upsert	—	✅
Bulk processing	✅	✅
JSONL export (LLM-ready)	✅	✅
Zero DOM / no browser	✅	✅
Price	$0.01/1k	$0.01/1k

Why this exists

Hundreds of Apify crawlers output raw HTML full of nav bars, footers, scripts, and ads. Feeding that into a Vector DB or LLM wastes tokens and pollutes embeddings. This Actor takes any already-crawled content and delivers production-ready chunks — with or without a Vector DB pipeline.

Features

HTML → Clean Markdown — strips scripts, styles, nav, footer, iframes, SVG, canvas, and comment garbage; converts headers, lists, tables, blockquotes, links, images, bold, italic, code into proper Markdown syntax.
Smart Chunking — splits by natural boundaries (paragraph breaks, headers) with configurable overlap to preserve context; avoids cutting words mid-stream.
Token-Aware Chunking (Enterprise) — uses js-tiktoken (cl100k_base) to split by actual LLM tokens instead of characters. Compatible with GPT-4, GPT-3.5, text-embedding-3-small.
Pinecone + Qdrant Auto-Upsert (Enterprise) — generates embeddings via OpenAI text-embedding-3-small and upserts vectors directly to your Pinecone index or Qdrant collection. No glue code needed. Auto-detects which vector DB to use from your input.
Bulk Processing — accepts an array of HTML documents and processes each independently with per-record chunk settings.
URL Fetching — provide an array of URLs; the Actor automatically fetches and processes each one through the full pipeline.
Semantic Chunking (updated in v1.4) — heading-aware chunking that respects document structure. Splits on # headings, keeps related content together, preserves heading context in chunk metadata. Auto-selected for content >5000 characters.
JSONL Export — download chunks as JSONL (one JSON object per line) for direct LLM fine-tuning, embedding batch jobs, or LangChain/LlamaIndex ingestion.
Zero DOM dependency — pure string processing; runs on any Node.js platform without a browser or headless client.
MCP / AI Agent Ready — callable via API; JSON output integrates directly with LangChain, LlamaIndex, Haystack, or custom RAG pipelines.
Backward Compatible — Enterprise mode activates only when you provide API keys. Standard mode works exactly as before.

How Enterprise Mode Works

HTML → Clean Markdown → Token-aware chunking (cl100k_base) → OpenAI Embedding → Pinecone or Qdrant upsert

Provide OpenAI API key + (Pinecone keys or Qdrant config) → the Actor auto-detects and runs the full pipeline. No configuration, no middleware, no extra services. You can even push to both Pinecone and Qdrant simultaneously by providing all keys.

Use Cases

Who	Why
RAG Pipeline Builders	Convert scraped pages → chunks → embeddings → Vector DB
LLM Fine-tuning	Clean training data by removing structural HTML garbage
AI Agents	Feed clean Markdown context to tool-calling LLMs
Content Analysts	Extract structured text from raw website dumps

Input

Standard Mode

Field	Type	Default	Description
`htmlContent`	string	—	Raw HTML or text content to process
`urls`	array	`[]`	URLs to auto-fetch and process (alternative to htmlContent)
`chunkingStrategy`	string	`auto`	`auto`, `character`, or `semantic`. Semantic respects heading boundaries
`chunkSize`	integer	`1000`	Target chunk length in characters or tokens
`chunkOverlap`	integer	`200`	Overlap between consecutive chunks (character mode only)
`mode`	string	`both`	Output mode: `both`, `markdown`, or `chunks`
`inputRecords`	array	`[]`	Bulk input `[{ id, html, chunkSize?, chunkOverlap? }]`
`deduplicate`	boolean	`false`	Skip records with identical content hash
`minQualityScore`	integer	`0`	Minimum quality score (0–100); skip low-quality content
`embeddingModel`	string	`text-embedding-3-small`	OpenAI embedding model for vector DB pipeline
`batchSize`	integer	`50`	Max records per embedding batch (Enterprise mode)

Enterprise Mode

Provide OpenAI key + either Pinecone or Qdrant config to activate the full pipeline.

Pinecone Pipeline

Field	Type	Description
`openaiApiKey`	string (secret)	Your OpenAI API key for token chunking + embeddings
`pineconeApiKey`	string (secret)	Your Pinecone API key
`pineconeIndex`	string	Your Pinecone index name (must be 1536-dimension)

Qdrant Pipeline

Field	Type	Description
`openaiApiKey`	string (secret)	Same key — shared across all Enterprise features
`qdrantUrl`	string	Your Qdrant instance URL (e.g. `https://xxx.us-east-1-0.aws.cloud.qdrant.io`)
`qdrantApiKey`	string (secret)	Your Qdrant API key
`qdrantCollection`	string	Your Qdrant collection name (1536-dimension vectors)

When Enterprise fields are detected, the Actor automatically:

Switches from character-based to token-aware chunking (cl100k_base)
Generates 1536-dimension embeddings via text-embedding-3-small
Upserts vectors to Pinecone and/or Qdrant with metadata (chunkIndex, tokenCount, source text)

The Actor auto-detects which vector DB to use:

Pinecone → requires openaiApiKey + pineconeApiKey + pineconeIndex
Qdrant → requires openaiApiKey + qdrantUrl + qdrantApiKey + qdrantCollection
Both → provide all keys; runs both pipelines simultaneously

Where to get the keys

Key	How to get
`openaiApiKey`	platform.openai.com/api-keys — create a new secret key
`pineconeApiKey`	app.pinecone.io → API Keys → Copy
`pineconeIndex`	Create a serverless index with dimension 1536 (matching `text-embedding-3-small`)
`qdrantUrl`	cloud.qdrant.io → Clusters → REST API Endpoint
`qdrantApiKey`	cloud.qdrant.io → Clusters → API Key
`qdrantCollection`	Create a collection with dimension 1536 and `Cosine` distance

Output

Each processed record returns:

Field	Type	Description
`recordId`	string	ID of the processed record
`status`	string	`ok` or `empty`
`rawMarkdown`	string	Cleaned Markdown (if mode includes `markdown` or `both`)
`chunks`	array	Array of `{ chunkIndex, content, characterCount, tokenCount?, headingPath?, chunkType?, contentHash?, qualityScore?, codeBlocks? }`
`stats`	object	`{ rawChars, cleanedChars, totalChunks, chunkSize, chunkOverlap, chunkingMode, avgQualityScore?, totalTokens? }`

A summary entry is appended at the end with aggregate statistics across all records.

Pricing

Pay Per Event — $0.01 per 1,000 results.

One result = one processed record (not per chunk). Processing 5 records with 200 total chunks = 5 billable results. Enterprise mode costs the same — you only pay OpenAI and Pinecone directly for their API usage.

Examples

Standard Mode

Input:

{
  "htmlContent": "<html><body><h1>Hello World</h1><p>This is <strong>important</strong> content.</p></body></html>",
  "chunkSize": 500,
  "chunkOverlap": 50
}

Output:

{
  "recordId": "default",
  "status": "ok",
  "rawMarkdown": "# Hello World\n\nThis is **important** content.",
  "chunks": [
    { "chunkIndex": 0, "content": "# Hello World\n\nThis is **important** content.", "characterCount": 49 }
  ],
  "stats": { "rawChars": 107, "cleanedChars": 49, "totalChunks": 1, "chunkSize": 500, "chunkOverlap": 50, "chunkingMode": "character" }
}

Enterprise Mode (Token Chunking + Pinecone)

Input:

{
  "htmlContent": "<h1>Enterprise RAG Pipeline</h1><p>Your HTML content here...</p>",
  "chunkSize": 500,
  "chunkOverlap": 100,
  "openaiApiKey": "sk-...",
  "pineconeApiKey": "pc-...",
  "pineconeIndex": "my-rag-index"
}

Output includes:

{
  "stats": { "totalChunks": 12, "totalTokens": 482, "chunkingMode": "token" },
  "chunks": [
    { "chunkIndex": 0, "content": "...", "tokenCount": 42, "characterCount": 185 }
  ]
}

Vectors are upserted to Pinecone automatically.

Enterprise Mode (Token Chunking + Qdrant)

Input:

{
  "htmlContent": "<h1>Enterprise RAG Pipeline</h1><p>Your HTML content here...</p>",
  "chunkSize": 500,
  "chunkOverlap": 100,
  "openaiApiKey": "sk-...",
  "qdrantUrl": "https://xxx.us-east-1-0.aws.cloud.qdrant.io",
  "qdrantApiKey": "qdrant-...",
  "qdrantCollection": "my-rag-collection"
}

Vectors are upserted to Qdrant automatically via REST API (no SDK required).

Usage with AI Agents / MCP

// POST https://api.apify.com/v2/acts/foxpink~apify-rag-markdown-chunker/runs?token=YOUR_API_TOKEN
{
  "htmlContent": "<h1>Your HTML here</h1>",
  "chunkSize": 1000,
  "chunkOverlap": 200
}

Note: This tool operates on already-crawled content. Use with any Apify Web Scraper, Puppeteer, or browser-based Actor by piping its output into this Actor's inputRecords.

Compatibility

100% Node.js (18+)
No browser, no headless, no DOM
ESM (ECMAScript Modules)

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

PDF to Markdown & JSON (RAG-Ready)

basisweb/pdf-to-markdown-rag

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

BasisWeb

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

mick_

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

Text Splitter & Chunker for RAG / LLMs

zenomastro/text-splitter-for-llm

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Rosario Vitale

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

News to Markdown — RAG-Ready News Chunks API

nexgendata/news-announcements-rag-markdown

Convert news and announcements into RAG-ready Markdown chunks. Clean JSON for PR, media-monitoring teams and AI agents.

NexGenData

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

Website to RAG Dataset

sebastian-actors/website-to-rag-dataset

Convert public websites, docs, blogs, and XML sitemaps into clean Markdown, structured metadata, and stable chunks for RAG pipelines and vector databases.