RAG Data Ingestion: Website to AI Knowledge Base
Pricing
from $1.00 / 1,000 premium scraped pages
RAG Data Ingestion: Website to AI Knowledge Base
Under maintenanceMaster complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.
Pricing
from $1.00 / 1,000 premium scraped pages
Rating
0.0
(0)
Developer
tekk
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Universal AI Knowledge Scraper β Premium RAG Ingestion Engine
The high-fidelity bridge between the complex web and your LLM. Convert any website or documentation portal into cleaned, chunked, and token-accurate Markdown optimized for RAG pipelines.
Build production-grade RAG (Retrieval-Augmented Generation) datasets with a single Actor run. Feed the output directly into Pinecone, Weaviate, Qdrant, ChromaDB, or any vector store.
π‘οΈ Why Use This Actor?
Most scrapers return empty strings on modern documentation sites. This Actor was built to solve the "Invisible Web" problem.
| Feature | Standard Scrapers | This Actor |
|---|---|---|
| Vanilla HTML | β | β |
| Shadow DOM / Web Components | β (Empty Output) | β (Full Flattening) |
| Token Tracking | β (Manual Regex) | β (Native Tiktoken) |
| Modern Code Blocks | β (Garbled) | β (Clean GFM) |
- Built-in Token Counting for Budget Management β Every record includes a
usageobject with exact token counts, encoding type, and chunk parameters. Enterprise teams can calculate embedding costs before hitting the OpenAI API. - Shadow DOM Extraction β Successfully captures content from Shadow DOM-heavy sites (like Shoelace Web Components) where standard crawlers see nothing.
- Zero-Config Extraction β No CSS selectors to maintain. The density-based Readability algorithm adapts to any site layout automatically.
- Antifragile Stealth β BΓ©zier-curve mouse simulation and fingerprint rotation make this Actor invisible to Cloudflare, Akamai, and behavioral detection systems.
- CU-Optimized β Resource interception blocks images, fonts, and media. You get lower memory usage and higher concurrency at the same price.
π Key Features
- Hybrid Discovery β Priority parsing of
sitemap.xmlwith fallback to recursive<a>tag extraction. - Universal Extraction β Powered by Mozilla's Readability algorithm with recursive Shadow DOM flattening.
- Clean Markdown Output β Converts HTML to Markdown via Turndown with GFM support (tables, code blocks).
- Token-Aware Chunking β Splits content using
tiktoken(GPT-4o / o1 encodings) into configurable chunk sizes with overlap. - Bloom Filter Dedup β O(1) URL deduplication prevents infinite loops and duplicate scraping.
π¦ Output Format
Each record is a standardized JSON object ready for vector database ingestion:
{"metadata": {"source_url": "[https://docs.example.com/api/auth](https://docs.example.com/api/auth)","title": "Authentication β API Docs","crawled_at": "2026-04-30T13:00:00Z","site_name": "Example Docs","lang": "en"},"usage": {"total_tokens": 1010,"total_chunks": 2,"encoding": "o200k_base","chunk_size": 512,"chunk_overlap": 50},"content": [{"chunk_id": 1,"token_count": 512,"text": "### Authentication\n\nAll API requests require a Bearer token..."}],"raw_markdown": "### Authentication\n\nAll API requests require a Bearer token..."}