Pricing

from $5.00 / 1,000 results

Go to Apify Store

Docs Markdown Rag Ready Crawler

Try for free

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Dev with Bobby

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Docs Markdown RAG-Ready Crawler

An Apify Actor that crawls documentation websites and converts them into clean markdown with RAG-ready chunks for embeddings. Includes internal link graphs and content hashes for change detection.

Features

Markdown Conversion - Converts HTML content to clean, well-formatted markdown
RAG-Ready Chunks - Automatically splits content into chunks optimized for embedding models
Dual Crawler Support - Playwright for JavaScript SPAs, Cheerio for static HTML (faster)
Link Graph - Extracts internal link relationships for building knowledge graphs
Content Hashing - SHA-256 hashes for detecting content changes
Smart Content Extraction - Automatically identifies main content and removes navigation/noise
URL Normalization - Handles query params, trailing slashes, and tracking parameters

Output Datasets

The crawler generates multiple dataset types (identified by _datasetType):

Pages (`_datasetType: 'pages'`)

Full page data including:

url, normalizedUrl, canonicalUrl
title, h1, language
text - Plain text content
markdown - Converted markdown
excerpt - First 300 characters
depth - Crawl depth from start URL
referrers - URLs that linked to this page
outgoingInternalLinks, outgoingExternalLinks
contentHash - SHA-256 hash of markdown content
fetchedAt - ISO timestamp

Chunks (`_datasetType: 'chunks'`)

RAG-ready content chunks:

chunkId - Stable unique identifier
url, normalizedUrl
chunkIndex - Position in document
headingPath - Array of parent headings (e.g., ["Getting Started", "Installation"])
markdown, text - Chunk content
charStart, charEnd - Character positions in original document
chunkHash - Hash of chunk content
pageContentHash - Hash of parent page
tokenEstimate - Approximate token count

Edges (`_datasetType: 'edges'`)

Internal link graph:

from - Source URL (normalized)
to - Target URL (normalized)
type - Link type (a[href])
anchorText - Link text

Issues (`_datasetType: 'issues'`)

Crawl errors and warnings:

type - Error type
url - Affected URL
message - Error message
severity - Error severity level

Input Configuration

Parameter	Type	Default	Description
`domain`	string	required	Domain to crawl (e.g., `https://docs.example.com`)
`startUrls`	array	`[]`	Override start URLs (optional)
`maxPages`	integer	`200`	Maximum pages to crawl (1-10,000)
`maxDepth`	integer	`4`	Maximum crawl depth (1-10)
`makeRagReady`	boolean	`true`	Generate RAG-ready chunks
`mode`	string	`"docs"`	Extraction mode: `docs`, `article`, `generic`
`output`	string	`"all"`	Output: `all`, `pagesOnly`, `chunksOnly`, `edgesOnly`
`crawlerType`	string	`"playwright"`	Engine: `playwright` (for SPAs) or `cheerio` (for static)
`includeSubdomains`	boolean	`false`	Also crawl subdomains
`respectRobotsTxt`	boolean	`true`	Follow robots.txt rules
`removeSelectors`	array	`["nav", "aside", ...]`	CSS selectors to remove
`allowPatterns`	array	`[]`	Regex patterns for URLs to include
`denyPatterns`	array	`[".utm_.", ...]`	Regex patterns for URLs to exclude
`stripQueryParams`	boolean	`true`	Remove query parameters from URLs
`chunkTargetChars`	integer	`2500`	Target chunk size (500-10,000)
`chunkMaxChars`	integer	`4500`	Maximum chunk size (1,000-20,000)
`minChunkChars`	integer	`400`	Minimum chunk size (100-2,000)
`proxyConfiguration`	object	-	Apify proxy settings

Example Input

{
    "domain": "https://docs.convex.dev",
    "maxPages": 500,
    "maxDepth": 5,
    "makeRagReady": true,
    "mode": "docs",
    "output": "all",
    "crawlerType": "playwright",
    "chunkTargetChars": 2500,
    "chunkMaxChars": 4500
}

Crawler Types

Playwright (default)

Best for: JavaScript SPAs, React/Vue/Next.js documentation sites
Waits for domcontentloaded + content selectors for fast, reliable extraction
Slower but handles dynamic content
Timeout: 90 seconds per page (45s navigation)

Cheerio

Best for: Static HTML sites, traditional documentation
Much faster (no browser required)
Lower resource usage
Timeout: 45 seconds per page

Content Extraction

The crawler uses smart selectors to find main content:

Docs mode tries (in order):

main, article, [role="main"]
.content, .markdown, .prose
.theme-doc-markdown, .md-content, .docs-content
Falls back to body

Automatically removes noise elements:

nav, aside, header, footer
.toc, .sidebar, .navigation, .menu
Any custom selectors you specify

Chunking Strategy

Content is split into chunks based on:

Heading boundaries - New chunks at #, ##, ###, #### headings
Target size - Aims for ~2,500 characters per chunk
Max size - Hard limit at 4,500 characters
Min size - Avoids tiny chunks under 400 characters
Paragraph preservation - Splits at paragraph boundaries when possible
Sentence preservation - Falls back to sentence/word boundaries for very long paragraphs

Each chunk includes its headingPath for context, making it ideal for RAG systems.

Local Development

# Install dependencies
npm install

# Run locally
apify run

# Run with input
apify run --input='{"domain": "https://docs.example.com"}'

# Deploy to Apify
apify push

Technical Notes

9MB Limit: Apify dataset items have a ~9MB limit. Pages exceeding this are automatically truncated (with truncated: true flag).
URL Normalization: URLs are normalized (HTTPS, no trailing slashes, tracking params stripped) for deduplication.
Content Hashes: Use contentHash and chunkHash fields to detect content changes between crawls.
Stable Chunk IDs: chunkId is deterministic based on URL, position, and content - same content = same ID.

Dependencies

Apify SDK - Actor framework
Crawlee - Web scraping library
Playwright - Browser automation
Turndown - HTML to Markdown conversion

License

ISC

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Actums

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

Ahmed Jasarevic

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Mick

AI RAG Feeder V2

mickeywmoore/ai-rag-feeder-v2

Turn any website into AI-ready Markdown. Scrapes entire domains, removes ads/clutter, and formats text specifically for RAG pipelines and LLM training data.