RAG Knowledge Loader
Pricing
$1.00 / 1,000 results
Pricing
$1.00 / 1,000 results
Rating
0.0
(0)
Developer

BotFlowTech
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.
Features
- Crawls entire documentation sites recursively
- Extracts clean, structured content
- Removes navigation, headers, footers automatically
- Outputs vector-ready JSON format
- Supports GitBook, ReadTheDocs, Notion, and custom doc sites
Use Cases
- Build "Chat with Docs" chatbots
- Feed LLMs with up-to-date documentation
- Create knowledge bases for RAG pipelines
- Automated documentation updates for vector databases
Input Parameters
Required
- Start URLs (required): Array of documentation site URLs to scrape
- Example:
https://docs.apify.com/,https://your-gitbook-site.com
- Example:
Optional Configuration
-
Max pages to crawl (default: 1000): Maximum number of pages to scrape
- Minimum: 1
-
Include URL patterns (globs) (default: []): Only crawl URLs matching these patterns
- Example:
["**/api/**", "**/guides/**"]
- Example:
-
Exclude URL patterns (globs) (default:
["**/*.pdf", "**/*.zip", "**/login**", "**/signup**"]): Skip URLs matching these patterns -
Content CSS Selectors (default:
"article, main, .content, .markdown-body, #content, [role='main']"): Comma-separated CSS selectors for main content area -
Remove CSS Selectors (default:
"nav, header, footer, .sidebar, #sidebar, .navigation, .cookie-banner, script, style, iframe"): Selectors for elements to remove like navigation and headers -
Output Format (default:
"vector-ready"):"vector-ready": Flat structure optimized for embeddings"hierarchical": Nested structure with full metadata
-
Crawler Type (default:
"cheerio"):"cheerio": Fast HTTP crawler for static sites"playwright": Browser-based crawler for JavaScript-heavy sites
Example Input JSON
{ "startUrls": [ { "url": "https://docs.example.com/" }, { "url": "https://your-gitbook.com/docs" } ], "maxPages": 500, "excludeUrlGlobs": ["/*.pdf", "/login**", "/signup"], "includeUrlGlobs": ["/docs/"], "contentSelectors": "article, main, .markdown-body", "removeSelectors": "nav, footer, .sidebar", "outputFormat": "vector-ready", "crawlerType": "cheerio" }
Minimal Input Example
{ "startUrls": [ { "url": "https://docs.example.com/" } ] }
Output Format
Vector-Ready Format (Default)
Optimized for direct ingestion into vector databases:
{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"], "readyForEmbedding": true }, "documents": [ { "id": "unique-doc-id-123", "text": "Full page content with all text extracted and cleaned...", "metadata": { "source": "https://docs.example.com/page", "title": "Page Title", "url": "https://docs.example.com/page", "scrapedAt": "2025-12-06T08:11:00.000Z", "wordCount": 1234 } } ] }
Hierarchical Format
Includes full document structure with headings and metadata:
{ "metadata": { "crawledAt": "2025-12-06T08:11:00.000Z", "totalPages": 150, "startUrls": ["https://docs.example.com/"] }, "documents": [ { "id": "unique-doc-id-123", "url": "https://docs.example.com/page", "title": "Page Title", "content": "Full page content...", "metadata": { "description": "Page meta description", "keywords": "api, documentation", "scrapedAt": "2025-12-06T08:11:00.000Z", "headings": [ { "level": 1, "text": "Introduction" }, { "level": 2, "text": "Getting Started" } ], "wordCount": 1234, "characterCount": 5678 } } ] }
Integration with Vector Databases
The output is ready to use with popular RAG frameworks:
- LangChain: Use JSONLoader to load documents
- LlamaIndex: Import as Document objects
- Pinecone/Weaviate: Batch upsert with metadata
- Chroma: Add to collection with embeddings