Docs-to-RAG AI Crawler
Pricing
from $0.20 / 1,000 page scrapeds
Docs-to-RAG AI Crawler
Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,
Pricing
from $0.20 / 1,000 page scrapeds
Rating
0.0
(0)
Developer
charitable_jeopardy
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
5 hours ago
Last modified
Categories
Share
AI & RAG Documentation Ingester (Pre-Chunked Web Crawler)
Stop wasting LLM tokens and vector DB space on website headers, footers, cookie banners, and navigation menus.
This Actor crawls public documentation sites, blogs, and knowledge bases, extracts only the core body content, and outputs clean, pre-chunked text records mapped to their nearest headings—complete with incremental change detection to keep your vector database synced efficiently.
🎯 Best For
- RAG & LLM Developers looking to ingest clean documentation, guides, or manuals into vector databases (Pinecone, Qdrant, PGVector, etc.).
- AI Product Teams building custom customer support agents or search engines over vertical/niche websites.
- Knowledge Engineers who need to monitor specific websites and ingest only new or updated pages.
Why this is better than a generic crawler
- Zero Noise: Automatically strips out navigation links, scripts, CSS, sidebars, newsletter boxes, and cookie overlays before parsing.
- Context-Aware Chunking: Instead of naive character splitting, it generates overlapping text blocks and attaches the relevant heading hierarchy (
h1–h6) to every single chunk. - Stateful Incremental Ingestion: Uses a persistent Key-Value Store across runs to compare page content hashes. It flags pages as
new,changed, orunchangedso you only update changed chunks in your database.
💡 Example Workflow: Ingesting a Blog to Pinecone
- Configure Target: Input the seed URL or sitemap (e.g.,
https://example.com/sitemap.xml). - Filter blog posts: Add
https://example.com/blog/**to Include patterns and exclude tags/authors. - Enable Chunking & Change Detection: Set
chunkText: trueanddetectChanges: true. - Configure Output: Set format to
chunksorpagesAndChunks. - Sync: Run the Actor, retrieve only the
neworchangedchunks from the dataset, and upsert them to your vector database.
📄 Example Output: Chunk Record
Each chunk is a self-contained record ready for embedding generation:
{"recordType": "chunk","chunkId": "a8f9c118bc28a192c73d9059f0f9bde0","pageUrl": "https://example.com/docs/getting-started","canonicalUrl": "https://example.com/docs/getting-started","site": "example.com","title": "Getting Started Guide | Documentation","chunkIndex": 0,"chunkText": "To install the library, run 'npm install @sdk/core'. Make sure you have Node.js version 20 or higher installed in your environment before initiating setup...","chunkCharStart": 0,"chunkCharEnd": 150,"chunkSize": 1000,"chunkOverlap": 150,"headingsContext": [{ "level": 1, "text": "Getting Started" },{ "level": 2, "text": "Installation" }],"language": "en","contentHash": "8f3c9e...","timestamp": "2026-06-06T12:00:00.000Z"}
⚙️ Quick Start
- Start URLs / Sitemap URLs: Provide at least one URL. The default input uses
https://example.com/so the Actor produces a small dataset item without setup. - Use Browser Rendering: Toggle on if the page relies heavily on client-side JavaScript (React, Vue, etc.) to render body text.
- Max Pages Per Site: Bounded limit (default
1) to keep the prefilled run fast and prevent uncontrolled resource use. - Chunk Size & Overlap: Match this to your LLM's context window guidelines (e.g., size
1000chars, overlap150chars).
Example Input
{"startUrls": [{ "url": "https://example.com/" }],"sitemapUrls": [],"maxPagesPerSite": 1,"includePatterns": [],"excludePatterns": [],"crawlDepth": 0,"maxCrawlRetries": 1,"useBrowserRendering": false,"languageDetection": true,"chunkText": false,"chunkSize": 1000,"chunkOverlap": 150,"outputFormat": "pages","detectChanges": false,"storeRawHtml": false,"storeCleanText": true}