Rag Knowledge Graph Builder
Pricing
from $0.01 / 1,000 results
Rag Knowledge Graph Builder
Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.
Pricing
from $0.01 / 1,000 results
Rating
5.0
(2)
Developer

csp
Actor stats
3
Bookmarked
3
Total users
2
Monthly active users
6 days ago
Last modified
Categories
Share
RAG-Ready Knowledge Graph Builder
🧠 Transform any website into a semantic dataset optimized for Retrieval-Augmented Generation (RAG)
The Problem
Traditional web scrapers produce giant walls of text (like llms-full.txt). For large sites, this approach has critical limitations:
- Exceeds context windows of most LLMs
- Models get "lost in the middle" of long documents
- Raw text provides no semantic structure for retrieval
- Poor retrieval accuracy in RAG pipelines
The Solution
This Actor creates a pre-indexed semantic dataset that AI agents can ingest instantly with high accuracy:
- Intelligent Crawling - Crawls websites following same-domain links
- Semantic Chunking - Uses recursive character splitting to create logical segments (500-1000 tokens)
- Hypothetical Question Generation - For every chunk, generates potential user questions using LLM
- RAG-Ready Output - Structured JSON where each object contains chunk text, source URL, and hypothetical questions
Why It's Better
Instead of raw data, you get pre-indexed data that skyrockets retrieval accuracy:
{"chunkId": "abc123_0","chunkText": "Apify is a platform for web scraping and automation...","sourceUrl": "https://docs.apify.com/platform","hypotheticalQuestions": ["What is Apify used for?","How does Apify help with web scraping?","What automation capabilities does Apify provide?"],"tokenCount": 487,"metadata": {"pageTitle": "Apify Platform Overview","crawledAt": "2024-01-15T10:30:00Z"}}
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs to start crawling from |
maxCrawlPages | integer | 50 | Maximum pages to crawl (0 = unlimited) |
maxCrawlDepth | integer | 3 | Maximum link depth from start URLs |
chunkSize | integer | 750 | Target chunk size in tokens |
chunkOverlap | integer | 100 | Overlapping tokens between chunks |
questionsPerChunk | integer | 3 | Hypothetical questions per chunk |
llmProvider | string | "openai" | LLM provider (openai/anthropic) |
llmModel | string | "gpt-4o-mini" | Model for question generation |
openaiApiKey | string | - | OpenAI API key (required for OpenAI) |
anthropicApiKey | string | - | Anthropic API key (required for Anthropic) |
excludeSelectors | array | [...] | CSS selectors to exclude |
urlPatterns | array | [] | URL patterns to include |
excludeUrlPatterns | array | [...] | URL patterns to exclude |
Output
Dataset (per chunk)
{"chunkId": "unique_chunk_identifier","chunkIndex": 0,"chunkText": "The actual text content...","tokenCount": 523,"sourceUrl": "https://example.com/page","pageTitle": "Page Title","pageDescription": "Meta description","hypotheticalQuestions": ["Question 1?","Question 2?","Question 3?"],"questionsCount": 3,"metadata": {"crawledAt": "2024-01-15T10:30:00Z","chunkStart": 0,"chunkEnd": 2100,"totalChunksInPage": 5}}
Key-Value Store
- OUTPUT - Processing summary with statistics
- rag-dataset.json - Complete dataset as single JSON file
Use Cases
1. Build a Documentation Chatbot
Crawl your docs site and create a knowledge base for a customer support bot.
2. Create a Research Assistant
Index academic papers or research sites for semantic search.
3. Power a Content Discovery Engine
Build a recommendation system based on semantic similarity.
4. Train Custom Embeddings
Use the chunks and questions to fine-tune embedding models.
LLM Cost Estimation
Using GPT-4o-mini (~$0.15/1M input tokens, ~$0.60/1M output tokens):
- 100 pages × 5 chunks/page × 3 questions = ~$0.10-0.20
Using Claude 3 Haiku (~$0.25/1M input tokens, ~$1.25/1M output tokens):
- 100 pages × 5 chunks/page × 3 questions = ~$0.15-0.30
Integration Examples
With LangChain
from langchain.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddings# Load the RAG datasetchunks = load_apify_dataset("your-run-id")# Create documents with questions as metadatadocuments = []for chunk in chunks:doc = Document(page_content=chunk["chunkText"],metadata={"source": chunk["sourceUrl"],"questions": chunk["hypotheticalQuestions"]})documents.append(doc)# Create vector storevectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())
With LlamaIndex
from llama_index import Document, VectorStoreIndex# Create documents from chunksdocuments = [Document(text=chunk["chunkText"],metadata={"url": chunk["sourceUrl"],"questions": chunk["hypotheticalQuestions"]})for chunk in chunks]# Build indexindex = VectorStoreIndex.from_documents(documents)
Technical Details
Chunking Strategy
- Recursive Character Splitter - Splits on semantic boundaries (paragraphs → sentences → words)
- Token-based sizing - Uses tiktoken for accurate GPT-4 token counting
- Overlap handling - Maintains context between chunks
Question Generation
- Uses system prompts optimized for retrieval-focused questions
- Generates diverse question types (what, how, why, when, etc.)
- Questions are self-contained and specific to chunk content
License
ISC
Support
For issues or feature requests, please open an issue on the repository.