rag-docs-scraper
Pricing
Pay per usage
Go to Apify Store

rag-docs-scraper
Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Hastin S.
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
AI Documentation & RAG Scraper 🤖📄
The AI Documentation & RAG Scraper is a high-performance tool designed to transform messy technical documentation into clean, structured Markdown. It is specifically optimized for RAG (Retrieval-Augmented Generation) pipelines, LLM fine-tuning, and AI agents.
Stop feeding your AI noisy HTML. Get the clean text you need, instantly.
✨ Key Features
- Markdown Optimized: Automatically converts HTML to clean Markdown while preserving headers, code blocks, and tables.
- Noise Removal: Smartly identifies and strips out navbars, footers, sidebars, and cookie banners to focus only on the content.
- Modern Web Support: Powered by Playwright, it easily handles JavaScript-heavy documentation sites (React, Docusaurus, GitBook, Next.js).
- Recursive Crawling: Provide a homepage, and the scraper will automatically follow internal links to map out the entire documentation set.
- AI-Agent Ready: Output is structured perfectly for Vector Databases (Pinecone, Weaviate) or direct upload to ChatGPT/Claude.
🚀 How to Use
- Input URLs: Enter the starting URL of the documentation you want to scrape (e.g.,
https://docs.apify.com/). - Set Page Limit: Define how many pages you want to crawl to stay within your budget.
- Run & Download: Start the Actor and download your results in JSON, CSV, or Excel.
🛠️ Input Configuration
| Field | Type | Description |
|---|---|---|
| Start URLs | Array | The entry points for the crawl. Supports multiple URLs. |
| Max Pages | Integer | The maximum number of pages to crawl (default: 50). |
| Proxy | Object | Uses Apify Proxy to ensure high success rates and avoid rate limits. |
📊 Sample Output
{"url": "[https://crawlee.dev/docs/quick-start](https://crawlee.dev/docs/quick-start)","title": "Quick Start | Crawlee","markdown": "# Quick Start\n\nInstall Crawlee using npm...\n\n```bash\nnpm install crawlee playwright\n```","scrapedAt": "2026-05-07T12:00:00Z"}