Pricing

from $0.20 / 1,000 page scrapeds

Docs-to-RAG AI Crawler

Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,

Pricing

from $0.20 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

charitable_jeopardy

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

AI & RAG Documentation Ingester (Pre-Chunked Web Crawler)

Stop wasting LLM tokens and vector DB space on website headers, footers, cookie banners, and navigation menus.

This Actor crawls public documentation sites, blogs, and knowledge bases, extracts only the core body content, and outputs clean, pre-chunked text records mapped to their nearest headings—complete with incremental change detection to keep your vector database synced efficiently.

🎯 Best For

RAG & LLM Developers looking to ingest clean documentation, guides, or manuals into vector databases (Pinecone, Qdrant, PGVector, etc.).
AI Product Teams building custom customer support agents or search engines over vertical/niche websites.
Knowledge Engineers who need to monitor specific websites and ingest only new or updated pages.

Why this is better than a generic crawler

Zero Noise: Automatically strips out navigation links, scripts, CSS, sidebars, newsletter boxes, and cookie overlays before parsing.
Context-Aware Chunking: Instead of naive character splitting, it generates overlapping text blocks and attaches the relevant heading hierarchy (h1–h6) to every single chunk.
Stateful Incremental Ingestion: Uses a persistent Key-Value Store across runs to compare page content hashes. It flags pages as new, changed, or unchanged so you only update changed chunks in your database.

💡 Example Workflow: Ingesting a Blog to Pinecone

Configure Target: Input the seed URL or sitemap (e.g., https://example.com/sitemap.xml).
Filter blog posts: Add https://example.com/blog/** to Include patterns and exclude tags/authors.
Enable Chunking & Change Detection: Set chunkText: true and detectChanges: true.
Configure Output: Set format to chunks or pagesAndChunks.
Sync: Run the Actor, retrieve only the new or changed chunks from the dataset, and upsert them to your vector database.

📄 Example Output: Chunk Record

Each chunk is a self-contained record ready for embedding generation:

{
  "recordType": "chunk",
  "chunkId": "a8f9c118bc28a192c73d9059f0f9bde0",
  "pageUrl": "https://example.com/docs/getting-started",
  "canonicalUrl": "https://example.com/docs/getting-started",
  "site": "example.com",
  "title": "Getting Started Guide | Documentation",
  "chunkIndex": 0,
  "chunkText": "To install the library, run 'npm install @sdk/core'. Make sure you have Node.js version 20 or higher installed in your environment before initiating setup...",
  "chunkCharStart": 0,
  "chunkCharEnd": 150,
  "chunkSize": 1000,
  "chunkOverlap": 150,
  "headingsContext": [
    { "level": 1, "text": "Getting Started" },
    { "level": 2, "text": "Installation" }
  ],
  "language": "en",
  "contentHash": "8f3c9e...",
  "timestamp": "2026-06-06T12:00:00.000Z"
}

⚙️ Quick Start

Start URLs / Sitemap URLs: Provide at least one URL. The default input uses https://example.com/ so the Actor produces a small dataset item without setup.
Use Browser Rendering: Toggle on if the page relies heavily on client-side JavaScript (React, Vue, etc.) to render body text.
Max Pages Per Site: Bounded limit (default 1) to keep the prefilled run fast and prevent uncontrolled resource use.
Chunk Size & Overlap: Match this to your LLM's context window guidelines (e.g., size 1000 chars, overlap 150 chars).

Example Input

{
  "startUrls": [{ "url": "https://example.com/" }],
  "sitemapUrls": [],
  "maxPagesPerSite": 1,
  "includePatterns": [],
  "excludePatterns": [],
  "crawlDepth": 0,
  "maxCrawlRetries": 1,
  "useBrowserRendering": false,
  "languageDetection": true,
  "chunkText": false,
  "chunkSize": 1000,
  "chunkOverlap": 150,
  "outputFormat": "pages",
  "detectChanges": false,
  "storeRawHtml": false,
  "storeCleanText": true
}

AI Web Crawler → Markdown for LLM & RAG Knowledge Bases

lukas459/ai-web-to-markdown-crawler-llm-rag-optimized

Advanced web crawler extracting token-chunked Markdown for LLMs & RAG. Features dual-engine hybrid parsing (IBM Docling & Trafilatura) to preserve tables and layouts. Delivers clean text split by native tiktoken counts with rich metadata (headings, levels, tokens) for instant vector search.

lukas schmeck

Docs Change Monitor for AI

careybrown/docs-change-rag-ready-monitor

Monitor public docs, changelogs, help centers, status pages, and pricing pages for changes, then output clean Markdown and RAG-ready chunks for AI knowledge bases.

Carey Brown

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

nezha

5.0

Site Crawler: Website → Markdown Corpus for LLM/RAG

boxbox10/site-crawler

Crawl a whole website or docs site and get one clean, LLM-ready Markdown + JSON record per page (title, headings, content, links, token count). Built for RAG ingestion and AI knowledge bases.

Marvin Eguilos

Website to RAG Dataset

sebastian-actors/website-to-rag-dataset

Convert public websites, docs, blogs, and XML sitemaps into clean Markdown, structured metadata, and stable chunks for RAG pipelines and vector databases.

Sebastián S

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website to Markdown — LLM & RAG Content Exporter

sturdydata/website-markdown-exporter

Crawl any website and get one clean Markdown document per page — ready for RAG pipelines, vector databases, LLM fine-tuning, or docs migration. Boilerplate (nav, footers, cookie banners) stripped, main content auto-detected, sitemap-seeded crawling, robots.txt respected. HARD page caps and flat p...

Sturdy Data

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!