Under maintenance

Pricing

from $2.00 / 1,000 page scrapeds

Try for free

Go to Apify Store

Web Scraper RAG Ready

Under maintenance

Try for free

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Pricing

from $2.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

Alexy Traore

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

RAG Web Scraper: The Ultimate HTML-to-Markdown Converter for LLMs

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines.

Clean Markdown vs Raw HTML

Most web scrapers return raw HTML soup or noisy text — LLMs don't need that. This project acts as a specialized filter that extracts only the meaningful content, removes boilerplate, and outputs LLM-ready Markdown plus structured JSON you can plug directly into your AI workflows (LangChain, LlamaIndex, Pinecone, etc.).

⚡ Key Features

Feature	Description
🧼 Clean Markdown	Removes navs, footers, ads, and cookie banners automatically.
🧠 RAG Chunking	Splits content into token-sized chunks (default: 600) for Vector DBs.
🐢/⚡ Hybrid Mode	Starts fast (Cheerio). Auto-switches to Playwright if it detects a React/Next.js SPA.
💡 Q&A Optimized	Preserves context on StackOverflow/Discourse style pages (Question + Answer).
💰 Efficient Cost	Hybrid engine keeps compute units low. Pay usage fee only for results.

🏆 Why RAG Web Scraper?

Feature	RAG Web Scraper	Standard Scraper	Full Browser Scraper
Cost	💰 Low (Hybrid)	💰 Low	💸 High
JS Support	✅ Auto-detect	❌ No	✅ Yes
Output	🧼 Clean Markdown	🗑️ Raw HTML	📄 Text/HTML
RAG Ready	✅ Chunked JSON	❌ No	❌ No

📉 The "Before & After" Test

Don't feed garbage to your AI. See the difference:

🔴 Standard Crawl (Raw HTML)

Contains ~50% noise: menus, scripts, footers.

<nav>Home > Docs > API</nav>
<div class="cookie-banner">We use cookies! [Accept]</div>
<main>
  <h1>React Hooks Guide</h1>
  <div class="sidebar">Join our Discord!</div>
  <p>Hooks are a new addition in React 16.8.</p>
  <div class="ad-container">BUY COFFEE NOW</div>
</main>
<footer>© 2026 Meta Platforms, Inc.</footer>

Result: High token costs, potential hallucinations.

🟢 RAG Web Scraper (Markdown)

Contains 100% signal.

# Getting Started with Crawlee

## Installation

Install Crawlee using npm:

npm install crawlee

## Basic Usage

Create a simple crawler in just a few lines of code...

Result: Cheap embedding, accurate answers.

💰 Pricing

$2.00 usage fee per 1,000 pages

We use a smart hybrid engine (Cheerio first) to keep compute costs aggressively low.

Efficiency First: We attempt fast static extraction first.
Power When Needed: We only launch a full browser (Playwright) if absolutely necessary.
Fair Usage: You pay a small usage fee + standard compute units.

Why this model? It ensures you get the lowest possible price for simple sites, while guaranteeing capability for complex SPAs.

🚀 Usage

1. Simple Run

Perfect for testing or small docs.

{
  "startUrl": "https://docs.python.org/3/",
  "maxPages": 20
}

2. Advanced Run (RAG Pipeline)

Optimized for Vector Databases.

{
  "startUrl": "https://react.dev",
  "maxPages": 100,
  "includePaths": ["/learn/*"],
  "excludePaths": ["/community/*"],
  "chunkSize": 500,
  "outputFormat": "json",
  "enableChunking": true
}

🧠 Apify Run Options (Memory)

If you plan to scrape more than 20 pages in a single run, it's recommended to increase memory in the Apify Run options (e.g., 2–4 GB) to avoid timeouts and ensure stable crawling.

⚙️ Configuration

Option	Type	Default	Description
`startUrl`	String	(Required)	The URL to start crawling from.
`maxPages`	Integer	20	Maximum number of pages to crawl.
`maxDepth`	Integer	2	How deep to follow links (0 = start page only).
`outputFormat`	String	`json`	`json`: Structured RAG chunks + metadata. `markdown`: Plain .md files. `both`: Returns both formats.
`chunkSize`	Integer	600	Target size for chunks in tokens. Ideal for embeddings.
`includePaths`	Array	`[]`	Only crawl URLs matching these patterns (e.g. `/docs/*`).
`excludePaths`	Array	`[]`	Skip URLs matching these patterns.
`enableChunking`	Boolean	`true`	Enable smart chunking. Disable for full-page markdown only.
`stripReferences`	Boolean	`true`	Removes academic references/bibliography sections.
`usePlaywright`	Boolean	`false`	Force browser rendering (auto-detected by default).

🛠️ Technical Details

Smart Hybrid Crawling

We don't waste resources. The scraper starts in Fast Mode (Cheerio). If it detects a Single Page Application (React, Vue, Next.js), it automatically upgrades to Browser Mode (Playwright) to render the content correctly. You get the best of both worlds: speed when possible, power when needed.

Q&A Intelligence

Most scrapers flatten forums into a wall of text. We detect Q&A structures (StackOverflow, Discourse) and preserve the relationship between the Question and the Accepted Answer, ensuring your RAG system understands the context.

Noise Removal

We aggressively strip:

Navigation bars & Mega-menus
Footers & Legal disclaimers
Cookie consent banners & Popups
"Related Posts" widgets
Academic References/Bibliographies

📤 Output Formats

JSON (Recommended for RAG)

Returns an array of objects with metadata and chunks.

{
  "url": "https://example.com",
  "title": "Page Title",
  "markdown": "# Page Title\n\nContent...",
  "chunks": [
    { "content": "Chunk 1...", "tokens": 450 },
    { "content": "Chunk 2...", "tokens": 300 }
  ]
}

Markdown Files (outputFormat: 'markdown' or 'both')

When outputFormat is set to markdown or both, the full Markdown files are stored in the Apify Key-Value Store.

Note: The default Apify dataset only contains the JSON results. To get the actual .md files:

Go to the Key-Value Store tab in your Apify run.

Look for keys like OUTPUT or page-specific keys.

If using the API, target the Key-Value Store endpoint to download these raw files directly.

Markdown

Returns a single Markdown file per page (or combined), perfect for archiving or direct LLM context.

🙋 FAQ

Q: Does it work on sites behind login? A: Currently designed for public documentation and content sites.

Q: How do you count pages? A: Only successfully scraped pages count. If a page fails or is skipped, you aren't charged.

Q: Can I use this with LangChain? A: Yes! The JSON output is designed to be directly loaded into LangChain's ApifyDatasetLoader.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Universal RAG Web Scraper

express_kingfisher/rag-web-scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Prince Raj

AI RAG Feeder V2

mickeywmoore/ai-rag-feeder-v2

Turn any website into AI-ready Markdown. Scrapes entire domains, removes ads/clutter, and formats text specifically for RAG pipelines and LLM training data.

Mickey Moore

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

105K

4.6

Universal Web to Markdown (Bulk & AI-Ready)

lentic_october/web-to-markdown-converter

Bulk convert any website URLs to clean Markdown for AI & LLMs. Universal scraper that removes ads, scripts, and clutter. Optimized for RAG, ChatGPT, Claude, and LangChain. Fast, async, and API-ready.