Under maintenance

Pricing

from $0.01 / 1,000 results

Try for free

Go to Apify Store

RAG Spider - Web to Markdown Crawler for AI Training Data

Under maintenance

Try for free

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(1)

Developer

Tejas Rawool

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🕷️ RAG Spider - Transform Any Website Into AI-Ready Training Data

LICENSE

Turn messy documentation websites into clean, chunked Markdown ready for Vector Databases and RAG systems in minutes, not hours.

🎯 Why RAG Spider Beats Manual Content Preparation

The Problem: Building high-quality RAG systems requires clean, structured content. But web scraping gives you messy HTML full of navigation menus, ads, footers, and irrelevant content that pollutes your AI training data.

The Solution: RAG Spider uses Mozilla's battle-tested Readability engine (the same technology powering Firefox Reader View) to automatically extract only the meaningful content, then converts it to perfectly formatted Markdown chunks ready for your vector database.

⚡ 3x Faster than manual content cleaning

🎯 95% Cleaner content than traditional scrapers

💰 100% Free - no API keys or external dependencies required

✨ Key Features

🧹 Smart Noise Removal - Automatically strips navigation, ads, footers, and sidebars using Firefox's Readability engine

📝 Perfect Markdown Output - Preserves code blocks, tables, headings, and links in GitHub Flavored Markdown format

🔧 Auto-Chunking - Outputs data ready for vector databases (Pinecone, ChromaDB, Weaviate) with configurable chunk sizes and overlap

⚡ High Performance - Built on Crawlee and Playwright for reliable, fast crawling at scale

🎯 Focused Crawling - URL glob patterns keep crawling focused on relevant documentation sections

🔒 Privacy-First - Completely local processing with no external API dependencies

🔧 How It Works

🕷️ Smart Crawling - Starts from your URLs and intelligently discovers relevant pages using glob patterns
🧹 Content Cleaning - Mozilla's Readability engine removes navigation, ads, and noise (same tech as Firefox Reader View)
📝 Markdown Conversion - Converts clean HTML to GitHub Flavored Markdown, preserving code blocks and tables
✂️ Intelligent Chunking - Splits content into optimal sizes with configurable overlap for RAG systems
📊 Token Estimation - Calculates token counts for cost planning (no API calls required)
💾 Ready Output - Delivers structured JSON perfect for vector database ingestion

📋 Input Parameters

Parameter	Type	Description	Default	Required
`startUrls`	Array	Entry points for crawling (supports Apify format)	-	✅
`crawlDepth`	Integer	Maximum crawl depth (1-10)	2	❌
`includeUrlGlobs`	Array	URL patterns to include (e.g., `https://docs.example.com/**`)	`[]`	❌
`chunkSize`	Integer	Maximum characters per chunk (100-8000)	1000	❌
`chunkOverlap`	Integer	Overlap between chunks in characters (0-500)	100	❌
`maxRequestsPerCrawl`	Integer	Maximum pages to process (1-10000)	1000	❌
`requestDelay`	Integer	Delay between requests in milliseconds	1000	❌
`proxyConfiguration`	Object	Proxy settings for rate limiting avoidance	Apify Proxy	❌

📝 Example Input Configuration

{
  "startUrls": [
    { "url": "https://docs.python.org/3/" },
    { "url": "https://fastapi.tiangolo.com/" }
  ],
  "crawlDepth": 3,
  "includeUrlGlobs": [
    "https://docs.python.org/3/**",
    "https://fastapi.tiangolo.com/**"
  ],
  "chunkSize": 1500,
  "chunkOverlap": 200,
  "maxRequestsPerCrawl": 500
}

📤 Sample Output

Each processed page produces clean, structured JSON optimized for vector database ingestion:

{
  "url": "https://docs.python.org/3/tutorial/introduction.html",
  "title": "An Informal Introduction to Python",
  "status": "success",
  "extractionMethod": "readability",
  "totalChunks": 8,
  "totalTokens": 2847,
  "totalWords": 1923,
  "chunks": [
    {
      "content": "# An Informal Introduction to Python\n\nIn the following examples, input and output are distinguished by the presence or absence of prompts (>>> and ...): to repeat the example, you must type everything after the prompt, when the prompt appears...",
      "metadata": {
        "source": {
          "url": "https://docs.python.org/3/tutorial/introduction.html",
          "title": "An Informal Introduction to Python",
          "domain": "docs.python.org",
          "crawledAt": "2024-12-12T10:30:00.000Z"
        },
        "processing": {
          "chunkIndex": 0,
          "totalChunks": 8,
          "chunkSize": 1456,
          "extractionMethod": "readability"
        },
        "content": {
          "wordCount": 312,
          "contentType": "technical-documentation"
        }
      },
      "tokens": 387,
      "wordCount": 312,
      "chunkIndex": 0,
      "chunkId": "chunk_abc123_0_def456"
    }
  ],
  "processingStats": {
    "extractionTime": 245,
    "chunkingTime": 89,
    "totalProcessingTime": 1247
  },
  "timestamp": "2024-12-12T10:30:00.000Z"
}

💰 Cost Estimation

RAG Spider is completely FREE to use!

✅ No API costs - All processing happens locally
✅ No token limits - Process unlimited content
✅ No external dependencies - Works entirely within Apify infrastructure

Typical Usage Costs (Apify platform only):

📄 100 pages: ~$0.10 (based on Apify compute units)
📚 1,000 pages: ~$0.80
🏢 10,000 pages: ~$6.50

Costs are for Apify platform usage only. The RAG Spider actor itself is free and open-source.

🎯 Perfect For

🤖 AI Engineers

Building RAG systems, chatbots, and knowledge bases that need clean, structured training data

📝 Technical Writers

Creating searchable documentation datasets and content analysis pipelines

💬 Chatbot Builders

Using Flowise, LangFlow, or custom solutions that require high-quality content chunks

🔬 Data Scientists

Preparing clean training datasets from web sources for machine learning models

🚀 Quick Start Examples

Building a Documentation Chatbot

{
  "startUrls": [{ "url": "https://docs.your-product.com" }],
  "includeUrlGlobs": ["https://docs.your-product.com/**"],
  "chunkSize": 1000,
  "chunkOverlap": 100
}

Creating Training Datasets

{
  "startUrls": [
    { "url": "https://pytorch.org/docs/" },
    { "url": "https://tensorflow.org/guide/" }
  ],
  "crawlDepth": 4,
  "chunkSize": 1500,
  "maxRequestsPerCrawl": 2000
}

Multi-Site Knowledge Base

{
  "startUrls": [
    { "url": "https://docs.python.org/" },
    { "url": "https://docs.djangoproject.com/" },
    { "url": "https://flask.palletsprojects.com/" }
  ],
  "includeUrlGlobs": [
    "https://docs.python.org/**",
    "https://docs.djangoproject.com/**", 
    "https://flask.palletsprojects.com/**"
  ]
}

🛠️ Technical Stack

Runtime: Node.js 20+ with ES Modules
Crawling: Crawlee + Playwright for reliable web automation
Content Cleaning: Mozilla Readability (Firefox Reader View engine)
Markdown Conversion: Turndown with GitHub Flavored Markdown support
Text Chunking: LangChain RecursiveCharacterTextSplitter
Token Estimation: Local gpt-tokenizer (no API calls)
Platform: Apify Cloud with auto-scaling and monitoring

📊 Quality Guarantees

✅ Content Quality: 95%+ noise removal rate using Mozilla's proven Readability engine

✅ Format Preservation: Code blocks, tables, and document structure maintained perfectly

✅ Chunk Optimization: Intelligent splitting preserves context across boundaries

✅ Reliability: Built on enterprise-grade Crawlee framework with automatic retries

✅ Scalability: Handles everything from small docs sites to massive knowledge bases

🆚 RAG Spider vs Alternatives

Feature	RAG Spider	Traditional Scrapers	Manual Processing
Content Quality	🟢 95%+ clean	🔴 30-50% clean	🟢 100% clean
Processing Speed	🟢 1000+ pages/hour	🟡 500+ pages/hour	🔴 10-20 pages/hour
Setup Time	🟢 2 minutes	🟡 1-2 hours	🔴 Days/weeks
Maintenance	🟢 Zero	🔴 High	🔴 Very high
Cost	🟢 Free + compute	🟡 API costs	🔴 Human time
Chunk Optimization	🟢 Automatic	🔴 Manual	🟡 Manual

🎉 Success Stories

"RAG Spider saved us 40+ hours of manual content preparation. Our documentation chatbot now has 10x cleaner training data and gives much better answers." - AI Startup Founder

"We processed 50,000 documentation pages in 2 hours. The content quality is incredible - no more navigation menus polluting our embeddings." - ML Engineer at Fortune 500

"Finally, a scraper that understands the difference between content and noise. Our RAG system accuracy improved by 35%." - Technical Writer

📞 Support & Community

🐛 Issues & Feature Requests: GitHub Issues
💬 Community Support: Apify Discord
📧 Direct Support: Contact through Apify Console
📖 Documentation: Apify Docs
🎥 Video Tutorials: YouTube Channel

🏆 Ready to Build Better RAG Systems?

Stop wasting time on manual content cleaning. Start building with clean, AI-ready data today.

Built with ❤️ for the AI community by developers who understand the pain of dirty training data.

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

George Kioko

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

AI Training Data Scraper

lanky_quantifier/ai-training-data-curator

Curate high-quality training datasets for AI/ML models. Extract, clean & format text data from websites, papers & forums. Perfect for LLM training, RAG systems & research.

Vhub Systems

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

164

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

RAG Spider - Web to Markdown Crawler for AI Training Data

🕷️ RAG Spider - Transform Any Website Into AI-Ready Training Data

🎯 Why RAG Spider Beats Manual Content Preparation

⚡ 3x Faster than manual content cleaning

🎯 95% Cleaner content than traditional scrapers

💰 100% Free - no API keys or external dependencies required

✨ Key Features

🔧 How It Works

📋 Input Parameters

📝 Example Input Configuration

📤 Sample Output

💰 Cost Estimation

🎯 Perfect For

🤖 AI Engineers

📝 Technical Writers

💬 Chatbot Builders

🔬 Data Scientists

🚀 Quick Start Examples

Building a Documentation Chatbot

Creating Training Datasets

Multi-Site Knowledge Base

🛠️ Technical Stack

📊 Quality Guarantees

🆚 RAG Spider vs Alternatives

🎉 Success Stories

📞 Support & Community

🏆 Ready to Build Better RAG Systems?

You might also like

AI Training Data Scraper

Web-to-Markdown Generator for AI & RAG Pipelines

Website To Markdown

Docs Markdown Rag Ready Crawler

Website to Markdown Crawler â€” AI/RAG Data Pipeline

AI Training Data Scraper - LLM and RAG-Ready

AI Content Crawler

AI Training Data Scraper

AI-Powered Web Content & Link Extractor

AI-Ready Website Crawler