Web Scraper RAG Ready avatar
Web Scraper RAG Ready
Under maintenance

Pricing

from $2.00 / 1,000 page scrapeds

Go to Apify Store
Web Scraper RAG Ready

Web Scraper RAG Ready

Under maintenance

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Pricing

from $2.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

Alexy Traore

Alexy Traore

Maintained by Community

Actor stats

0

Bookmarked

0

Total users

0

Monthly active users

5 days ago

Last modified

Share

RAG Web Scraper: The Ultimate HTML-to-Markdown Converter for LLMs

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines.

Clean Markdown vs Raw HTML

Most web scrapers return raw HTML soup or noisy text — LLMs don't need that. This project acts as a specialized filter that extracts only the meaningful content, removes boilerplate, and outputs LLM-ready Markdown plus structured JSON you can plug directly into your AI workflows (LangChain, LlamaIndex, Pinecone, etc.).


⚡ Key Features

FeatureDescription
🧼 Clean MarkdownRemoves navs, footers, ads, and cookie banners automatically.
🧠 RAG ChunkingSplits content into token-sized chunks (default: 600) for Vector DBs.
🐢/⚡ Hybrid ModeStarts fast (Cheerio). Auto-switches to Playwright if it detects a React/Next.js SPA.
💡 Q&A OptimizedPreserves context on StackOverflow/Discourse style pages (Question + Answer).
💰 Efficient CostHybrid engine keeps compute units low. Pay usage fee only for results.

🏆 Why RAG Web Scraper?

FeatureRAG Web ScraperStandard ScraperFull Browser Scraper
Cost💰 Low (Hybrid)💰 Low💸 High
JS Support✅ Auto-detect❌ No✅ Yes
Output🧼 Clean Markdown🗑️ Raw HTML📄 Text/HTML
RAG Ready✅ Chunked JSON❌ No❌ No

📉 The "Before & After" Test

Don't feed garbage to your AI. See the difference:

🔴 Standard Crawl (Raw HTML)

Contains ~50% noise: menus, scripts, footers.

<nav>Home > Docs > API</nav>
<div class="cookie-banner">We use cookies! [Accept]</div>
<main>
<h1>React Hooks Guide</h1>
<div class="sidebar">Join our Discord!</div>
<p>Hooks are a new addition in React 16.8.</p>
<div class="ad-container">BUY COFFEE NOW</div>
</main>
<footer>© 2026 Meta Platforms, Inc.</footer>

Result: High token costs, potential hallucinations.

🟢 RAG Web Scraper (Markdown)

Contains 100% signal.

# Getting Started with Crawlee
## Installation
Install Crawlee using npm:
npm install crawlee
## Basic Usage
Create a simple crawler in just a few lines of code...

Result: Cheap embedding, accurate answers.


💰 Pricing

$2.00 usage fee per 1,000 pages

We use a smart hybrid engine (Cheerio first) to keep compute costs aggressively low.

  • Efficiency First: We attempt fast static extraction first.
  • Power When Needed: We only launch a full browser (Playwright) if absolutely necessary.
  • Fair Usage: You pay a small usage fee + standard compute units.

Why this model? It ensures you get the lowest possible price for simple sites, while guaranteeing capability for complex SPAs.


🚀 Usage

1. Simple Run

Perfect for testing or small docs.

{
"startUrl": "https://docs.python.org/3/",
"maxPages": 20
}

2. Advanced Run (RAG Pipeline)

Optimized for Vector Databases.

{
"startUrl": "https://react.dev",
"maxPages": 100,
"includePaths": ["/learn/*"],
"excludePaths": ["/community/*"],
"chunkSize": 500,
"outputFormat": "json",
"enableChunking": true
}

🧠 Apify Run Options (Memory)

If you plan to scrape more than 20 pages in a single run, it's recommended to increase memory in the Apify Run options (e.g., 2–4 GB) to avoid timeouts and ensure stable crawling.

⚙️ Configuration

OptionTypeDefaultDescription
startUrlString(Required)The URL to start crawling from.
maxPagesInteger20Maximum number of pages to crawl.
maxDepthInteger2How deep to follow links (0 = start page only).
outputFormatStringjsonjson: Structured RAG chunks + metadata.
markdown: Plain .md files.
both: Returns both formats.
chunkSizeInteger600Target size for chunks in tokens. Ideal for embeddings.
includePathsArray[]Only crawl URLs matching these patterns (e.g. /docs/*).
excludePathsArray[]Skip URLs matching these patterns.
enableChunkingBooleantrueEnable smart chunking. Disable for full-page markdown only.
stripReferencesBooleantrueRemoves academic references/bibliography sections.
usePlaywrightBooleanfalseForce browser rendering (auto-detected by default).

🛠️ Technical Details

Smart Hybrid Crawling

We don't waste resources. The scraper starts in Fast Mode (Cheerio). If it detects a Single Page Application (React, Vue, Next.js), it automatically upgrades to Browser Mode (Playwright) to render the content correctly. You get the best of both worlds: speed when possible, power when needed.

Q&A Intelligence

Most scrapers flatten forums into a wall of text. We detect Q&A structures (StackOverflow, Discourse) and preserve the relationship between the Question and the Accepted Answer, ensuring your RAG system understands the context.

Noise Removal

We aggressively strip:

  • Navigation bars & Mega-menus
  • Footers & Legal disclaimers
  • Cookie consent banners & Popups
  • "Related Posts" widgets
  • Academic References/Bibliographies

📤 Output Formats

Returns an array of objects with metadata and chunks.

{
"url": "https://example.com",
"title": "Page Title",
"markdown": "# Page Title\n\nContent...",
"chunks": [
{ "content": "Chunk 1...", "tokens": 450 },
{ "content": "Chunk 2...", "tokens": 300 }
]
}

Markdown Files (outputFormat: 'markdown' or 'both')

When outputFormat is set to markdown or both, the full Markdown files are stored in the Apify Key-Value Store.

Note: The default Apify dataset only contains the JSON results. To get the actual .md files:

  1. Go to the Key-Value Store tab in your Apify run.
  2. Look for keys like OUTPUT or page-specific keys.
  3. If using the API, target the Key-Value Store endpoint to download these raw files directly.

Markdown

Returns a single Markdown file per page (or combined), perfect for archiving or direct LLM context.


🙋 FAQ

Q: Does it work on sites behind login? A: Currently designed for public documentation and content sites.

Q: How do you count pages? A: Only successfully scraped pages count. If a page fails or is skipped, you aren't charged.

Q: Can I use this with LangChain? A: Yes! The JSON output is designed to be directly loaded into LangChain's ApifyDatasetLoader.