RAG Data Ingestion: Website to AI Knowledge Base avatar

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Pricing

from $1.00 / 1,000 premium scraped pages

Go to Apify Store
RAG Data Ingestion: Website to AI Knowledge Base

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

Pricing

from $1.00 / 1,000 premium scraped pages

Rating

0.0

(0)

Developer

tekk

tekk

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Universal AI Knowledge Scraper β€” Premium RAG Ingestion Engine

The high-fidelity bridge between the complex web and your LLM. Convert any website or documentation portal into cleaned, chunked, and token-accurate Markdown optimized for RAG pipelines.

Build production-grade RAG (Retrieval-Augmented Generation) datasets with a single Actor run. Feed the output directly into Pinecone, Weaviate, Qdrant, ChromaDB, or any vector store.


πŸ›‘οΈ Why Use This Actor?

Most scrapers return empty strings on modern documentation sites. This Actor was built to solve the "Invisible Web" problem.

FeatureStandard ScrapersThis Actor
Vanilla HTMLβœ…βœ…
Shadow DOM / Web Components❌ (Empty Output)βœ… (Full Flattening)
Token Tracking❌ (Manual Regex)βœ… (Native Tiktoken)
Modern Code Blocks❌ (Garbled)βœ… (Clean GFM)
  • Built-in Token Counting for Budget Management β€” Every record includes a usage object with exact token counts, encoding type, and chunk parameters. Enterprise teams can calculate embedding costs before hitting the OpenAI API.
  • Shadow DOM Extraction β€” Successfully captures content from Shadow DOM-heavy sites (like Shoelace Web Components) where standard crawlers see nothing.
  • Zero-Config Extraction β€” No CSS selectors to maintain. The density-based Readability algorithm adapts to any site layout automatically.
  • Antifragile Stealth β€” BΓ©zier-curve mouse simulation and fingerprint rotation make this Actor invisible to Cloudflare, Akamai, and behavioral detection systems.
  • CU-Optimized β€” Resource interception blocks images, fonts, and media. You get lower memory usage and higher concurrency at the same price.

πŸš€ Key Features

  • Hybrid Discovery β€” Priority parsing of sitemap.xml with fallback to recursive <a> tag extraction.
  • Universal Extraction β€” Powered by Mozilla's Readability algorithm with recursive Shadow DOM flattening.
  • Clean Markdown Output β€” Converts HTML to Markdown via Turndown with GFM support (tables, code blocks).
  • Token-Aware Chunking β€” Splits content using tiktoken (GPT-4o / o1 encodings) into configurable chunk sizes with overlap.
  • Bloom Filter Dedup β€” O(1) URL deduplication prevents infinite loops and duplicate scraping.

πŸ“¦ Output Format

Each record is a standardized JSON object ready for vector database ingestion:

{
"metadata": {
"source_url": "[https://docs.example.com/api/auth](https://docs.example.com/api/auth)",
"title": "Authentication β€” API Docs",
"crawled_at": "2026-04-30T13:00:00Z",
"site_name": "Example Docs",
"lang": "en"
},
"usage": {
"total_tokens": 1010,
"total_chunks": 2,
"encoding": "o200k_base",
"chunk_size": 512,
"chunk_overlap": 50
},
"content": [
{
"chunk_id": 1,
"token_count": 512,
"text": "### Authentication\n\nAll API requests require a Bearer token..."
}
],
"raw_markdown": "### Authentication\n\nAll API requests require a Bearer token..."
}