Under maintenance

Pricing

from $1.00 / 1,000 premium scraped pages

Try for free

Go to Apify Store

RAG Data Ingestion: Website to AI Knowledge Base

Under maintenance

Try for free

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

Pricing

from $1.00 / 1,000 premium scraped pages

Rating

0.0

(0)

Developer

tekk

Actor stats

Bookmarked

Total users

Monthly active users

23 days ago

Last modified

Universal AI Knowledge Scraper — Premium RAG Ingestion Engine

The high-fidelity bridge between the complex web and your LLM. Convert any website or documentation portal into cleaned, chunked, and token-accurate Markdown optimized for RAG pipelines.

Build production-grade RAG (Retrieval-Augmented Generation) datasets with a single Actor run. Feed the output directly into Pinecone, Weaviate, Qdrant, ChromaDB, or any vector store.

🛡️ Why Use This Actor?

Most scrapers return empty strings on modern documentation sites. This Actor was built to solve the "Invisible Web" problem.

Feature	Standard Scrapers	This Actor
Vanilla HTML	✅	✅
Shadow DOM / Web Components	❌ (Empty Output)	✅ (Full Flattening)
Token Tracking	❌ (Manual Regex)	✅ (Native Tiktoken)
Modern Code Blocks	❌ (Garbled)	✅ (Clean GFM)

Built-in Token Counting for Budget Management — Every record includes a usage object with exact token counts, encoding type, and chunk parameters. Enterprise teams can calculate embedding costs before hitting the OpenAI API.
Shadow DOM Extraction — Successfully captures content from Shadow DOM-heavy sites (like Shoelace Web Components) where standard crawlers see nothing.
Zero-Config Extraction — No CSS selectors to maintain. The density-based Readability algorithm adapts to any site layout automatically.
Antifragile Stealth — Bézier-curve mouse simulation and fingerprint rotation make this Actor invisible to Cloudflare, Akamai, and behavioral detection systems.
CU-Optimized — Resource interception blocks images, fonts, and media. You get lower memory usage and higher concurrency at the same price.

🚀 Key Features

Hybrid Discovery — Priority parsing of sitemap.xml with fallback to recursive <a> tag extraction.
Universal Extraction — Powered by Mozilla's Readability algorithm with recursive Shadow DOM flattening.
Clean Markdown Output — Converts HTML to Markdown via Turndown with GFM support (tables, code blocks).
Token-Aware Chunking — Splits content using tiktoken (GPT-4o / o1 encodings) into configurable chunk sizes with overlap.
Bloom Filter Dedup — O(1) URL deduplication prevents infinite loops and duplicate scraping.

📦 Output Format

Each record is a standardized JSON object ready for vector database ingestion:

{
  "metadata": {
    "source_url": "[https://docs.example.com/api/auth](https://docs.example.com/api/auth)",
    "title": "Authentication — API Docs",
    "crawled_at": "2026-04-30T13:00:00Z",
    "site_name": "Example Docs",
    "lang": "en"
  },
  "usage": {
    "total_tokens": 1010,
    "total_chunks": 2,
    "encoding": "o200k_base",
    "chunk_size": 512,
    "chunk_overlap": 50
  },
  "content": [
    {
      "chunk_id": 1,
      "token_count": 512,
      "text": "### Authentication\n\nAll API requests require a Bearer token..."
    }
  ],
  "raw_markdown": "### Authentication\n\nAll API requests require a Bearer token..."
}

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

rag-docs-scraper

marbled_jury/my-actor

Extract clean, RAG-optimized Markdown from any technical documentation. Built for LLMs and AI agents. No noise, just high-fidelity data.

Hastin S.

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

5.0

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Webpage to Markdown Converter for LLMs

andok/markdown-extractor

Convert any URL into clean Markdown text. Remove ads and navbars to perfectly format web content for AI and RAG ingestion.

Andok

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Ecommerce RAG Ingestion Engine

blukaze/ecommerce-rag-ingestion-engine

Stop feeding your LLMs noisy HTML and irrelevant UI clutter. The Ecommerce RAG Ingestion Engine is a production-ready Apify Actor designed to transform entire ecommerce domains with a specialized focus on Shopify into clean, AI-ready knowledge bases.

Blukaze Automations

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.