Under maintenance

Pricing

$1.00 / 1,000 results

Try for free

Go to Apify Store

RAG Knowledge Loader

Under maintenance

Try for free

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

Pricing

$1.00 / 1,000 results

Rating

0.0

(0)

Developer

BotFlowTech

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

Features

Crawls entire documentation sites recursively
Extracts clean, structured content
Removes navigation, headers, footers automatically
Outputs vector-ready JSON format
Supports GitBook, ReadTheDocs, Notion, and custom doc sites

Use Cases

Build "Chat with Docs" chatbots
Feed LLMs with up-to-date documentation
Create knowledge bases for RAG pipelines
Automated documentation updates for vector databases

Input Parameters

Required

Start URLs (required): Array of documentation site URLs to scrape
- Example: https://docs.apify.com/, https://your-gitbook-site.com

Optional Configuration

Max pages to crawl (default: 1000): Maximum number of pages to scrape
- Minimum: 1
Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns
- Example: ["**/api/**", "**/guides/**"]
Exclude URL patterns (globs) (default: ["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns
Content CSS Selectors (default: "article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area
Remove CSS Selectors (default: "nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers
Output Format (default: "vector-ready"):
- "vector-ready": Flat structure optimized for embeddings
- "hierarchical": Nested structure with full metadata
Crawler Type (default: "cheerio"):
- "cheerio": Fast HTTP crawler for static sites
- "playwright": Browser-based crawler for JavaScript-heavy sites

Example Input JSON

{ "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/*.pdf", "/login**", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" }

Minimal Input Example

{ "startUrls": [ { "url": "https://docs.example.com/" } ] }

Output Format

Vector-Ready Format (Default)

Optimized for direct ingestion into vector databases:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] }

Hierarchical Format

Includes full document structure with headings and metadata:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] }

Integration with Vector Databases

The output is ready to use with popular RAG frameworks:

LangChain: Use JSONLoader to load documents
LlamaIndex: Import as Document objects
Pinecone/Weaviate: Batch upsert with metadata
Chroma: Add to collection with embeddings

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Actums

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Mick

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Dmitry Goncharov

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

Universal RAG Web Scraper

express_kingfisher/rag-web-scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).