RAG Knowledge Loader avatar
RAG Knowledge Loader
Under maintenance

Pricing

$1.00 / 1,000 results

Go to Apify Store
RAG Knowledge Loader

RAG Knowledge Loader

Under maintenance

RAG Knowledge Loader

Pricing

$1.00 / 1,000 results

Rating

0.0

(0)

Developer

BotFlowTech

BotFlowTech

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

Features

  • Crawls entire documentation sites recursively
  • Extracts clean, structured content
  • Removes navigation, headers, footers automatically
  • Outputs vector-ready JSON format
  • Supports GitBook, ReadTheDocs, Notion, and custom doc sites

Use Cases

  • Build "Chat with Docs" chatbots
  • Feed LLMs with up-to-date documentation
  • Create knowledge bases for RAG pipelines
  • Automated documentation updates for vector databases

Input Parameters

Required

  • Start URLs (required): Array of documentation site URLs to scrape
    • Example: https://docs.apify.com/, https://your-gitbook-site.com

Optional Configuration

  • Max pages to crawl (default: 1000): Maximum number of pages to scrape

    • Minimum: 1
  • Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns

    • Example: ["**/api/**", "**/guides/**"]
  • Exclude URL patterns (globs) (default: ["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns

  • Content CSS Selectors (default: "article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area

  • Remove CSS Selectors (default: "nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers

  • Output Format (default: "vector-ready"):

    • "vector-ready": Flat structure optimized for embeddings
    • "hierarchical": Nested structure with full metadata
  • Crawler Type (default: "cheerio"):

    • "cheerio": Fast HTTP crawler for static sites
    • "playwright": Browser-based crawler for JavaScript-heavy sites

Example Input JSON

{ "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/*.pdf", "/login**", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" }

Minimal Input Example

{ "startUrls": [ { "url": "https://docs.example.com/" } ] }

Output Format

Vector-Ready Format (Default)

Optimized for direct ingestion into vector databases:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] }

Hierarchical Format

Includes full document structure with headings and metadata:

{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] }

Integration with Vector Databases

The output is ready to use with popular RAG frameworks:

  • LangChain: Use JSONLoader to load documents
  • LlamaIndex: Import as Document objects
  • Pinecone/Weaviate: Batch upsert with metadata
  • Chroma: Add to collection with embeddings