rag-docs-scraper avatar

rag-docs-scraper

Pricing

Pay per usage

Go to Apify Store
rag-docs-scraper

rag-docs-scraper

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hastin S.

Hastin S.

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

AI Documentation & RAG Scraper 🤖📄

The AI Documentation & RAG Scraper is a high-performance tool designed to transform messy technical documentation into clean, structured Markdown. It is specifically optimized for RAG (Retrieval-Augmented Generation) pipelines, LLM fine-tuning, and AI agents.

Stop feeding your AI noisy HTML. Get the clean text you need, instantly.


✨ Key Features

  • Markdown Optimized: Automatically converts HTML to clean Markdown while preserving headers, code blocks, and tables.
  • Noise Removal: Smartly identifies and strips out navbars, footers, sidebars, and cookie banners to focus only on the content.
  • Modern Web Support: Powered by Playwright, it easily handles JavaScript-heavy documentation sites (React, Docusaurus, GitBook, Next.js).
  • Recursive Crawling: Provide a homepage, and the scraper will automatically follow internal links to map out the entire documentation set.
  • AI-Agent Ready: Output is structured perfectly for Vector Databases (Pinecone, Weaviate) or direct upload to ChatGPT/Claude.

🚀 How to Use

  1. Input URLs: Enter the starting URL of the documentation you want to scrape (e.g., https://docs.apify.com/).
  2. Set Page Limit: Define how many pages you want to crawl to stay within your budget.
  3. Run & Download: Start the Actor and download your results in JSON, CSV, or Excel.

🛠️ Input Configuration

FieldTypeDescription
Start URLsArrayThe entry points for the crawl. Supports multiple URLs.
Max PagesIntegerThe maximum number of pages to crawl (default: 50).
ProxyObjectUses Apify Proxy to ensure high success rates and avoid rate limits.

📊 Sample Output

{
"url": "[https://crawlee.dev/docs/quick-start](https://crawlee.dev/docs/quick-start)",
"title": "Quick Start | Crawlee",
"markdown": "# Quick Start\n\nInstall Crawlee using npm...\n\n```bash\nnpm install crawlee playwright\n```",
"scrapedAt": "2026-05-07T12:00:00Z"
}