Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

rag-docs-scraper

Under maintenance

Try for free

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hastin S.

Actor stats

Bookmarked

Total users

Monthly active users

23 days ago

Last modified

AI Documentation & RAG Scraper 🤖📄

The AI Documentation & RAG Scraper is a high-performance tool designed to transform messy technical documentation into clean, structured Markdown. It is specifically optimized for RAG (Retrieval-Augmented Generation) pipelines, LLM fine-tuning, and AI agents.

Stop feeding your AI noisy HTML. Get the clean text you need, instantly.

✨ Key Features

Markdown Optimized: Automatically converts HTML to clean Markdown while preserving headers, code blocks, and tables.
Noise Removal: Smartly identifies and strips out navbars, footers, sidebars, and cookie banners to focus only on the content.
Modern Web Support: Powered by Playwright, it easily handles JavaScript-heavy documentation sites (React, Docusaurus, GitBook, Next.js).
Recursive Crawling: Provide a homepage, and the scraper will automatically follow internal links to map out the entire documentation set.
AI-Agent Ready: Output is structured perfectly for Vector Databases (Pinecone, Weaviate) or direct upload to ChatGPT/Claude.

🚀 How to Use

Input URLs: Enter the starting URL of the documentation you want to scrape (e.g., https://docs.apify.com/).
Set Page Limit: Define how many pages you want to crawl to stay within your budget.
Run & Download: Start the Actor and download your results in JSON, CSV, or Excel.

🛠️ Input Configuration

Field	Type	Description
Start URLs	Array	The entry points for the crawl. Supports multiple URLs.
Max Pages	Integer	The maximum number of pages to crawl (default: 50).
Proxy	Object	Uses Apify Proxy to ensure high success rates and avoid rate limits.

📊 Sample Output

{
  "url": "[https://crawlee.dev/docs/quick-start](https://crawlee.dev/docs/quick-start)",
  "title": "Quick Start | Crawlee",
  "markdown": "# Quick Start\n\nInstall Crawlee using npm...\n\n```bash\nnpm install crawlee playwright\n```",
  "scrapedAt": "2026-05-07T12:00:00Z"
}

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

Ahmed Jasarevic

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Alaricus

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Vamsi Krishna

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

Website to Markdown — Clean Pages for RAG & LLMs

ryanclinton/website-content-to-markdown

Crawl any website and convert its HTML pages into clean, well-structured Markdown text. Purpose-built for AI workflows, RAG pipelines, LLM fine-tuning, and documentation archival. Give it URLs and get back Markdown documents with navigation, ads, and boilerplate stripped away.