Rag Knowledge Graph Builder avatar
Rag Knowledge Graph Builder

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Rag Knowledge Graph Builder

Rag Knowledge Graph Builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(2)

Developer

csp

csp

Maintained by Community

Actor stats

3

Bookmarked

3

Total users

2

Monthly active users

6 days ago

Last modified

Share

RAG-Ready Knowledge Graph Builder

🧠 Transform any website into a semantic dataset optimized for Retrieval-Augmented Generation (RAG)

The Problem

Traditional web scrapers produce giant walls of text (like llms-full.txt). For large sites, this approach has critical limitations:

  • Exceeds context windows of most LLMs
  • Models get "lost in the middle" of long documents
  • Raw text provides no semantic structure for retrieval
  • Poor retrieval accuracy in RAG pipelines

The Solution

This Actor creates a pre-indexed semantic dataset that AI agents can ingest instantly with high accuracy:

  1. Intelligent Crawling - Crawls websites following same-domain links
  2. Semantic Chunking - Uses recursive character splitting to create logical segments (500-1000 tokens)
  3. Hypothetical Question Generation - For every chunk, generates potential user questions using LLM
  4. RAG-Ready Output - Structured JSON where each object contains chunk text, source URL, and hypothetical questions

Why It's Better

Instead of raw data, you get pre-indexed data that skyrockets retrieval accuracy:

{
"chunkId": "abc123_0",
"chunkText": "Apify is a platform for web scraping and automation...",
"sourceUrl": "https://docs.apify.com/platform",
"hypotheticalQuestions": [
"What is Apify used for?",
"How does Apify help with web scraping?",
"What automation capabilities does Apify provide?"
],
"tokenCount": 487,
"metadata": {
"pageTitle": "Apify Platform Overview",
"crawledAt": "2024-01-15T10:30:00Z"
}
}

Input Configuration

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to start crawling from
maxCrawlPagesinteger50Maximum pages to crawl (0 = unlimited)
maxCrawlDepthinteger3Maximum link depth from start URLs
chunkSizeinteger750Target chunk size in tokens
chunkOverlapinteger100Overlapping tokens between chunks
questionsPerChunkinteger3Hypothetical questions per chunk
llmProviderstring"openai"LLM provider (openai/anthropic)
llmModelstring"gpt-4o-mini"Model for question generation
openaiApiKeystring-OpenAI API key (required for OpenAI)
anthropicApiKeystring-Anthropic API key (required for Anthropic)
excludeSelectorsarray[...]CSS selectors to exclude
urlPatternsarray[]URL patterns to include
excludeUrlPatternsarray[...]URL patterns to exclude

Output

Dataset (per chunk)

{
"chunkId": "unique_chunk_identifier",
"chunkIndex": 0,
"chunkText": "The actual text content...",
"tokenCount": 523,
"sourceUrl": "https://example.com/page",
"pageTitle": "Page Title",
"pageDescription": "Meta description",
"hypotheticalQuestions": [
"Question 1?",
"Question 2?",
"Question 3?"
],
"questionsCount": 3,
"metadata": {
"crawledAt": "2024-01-15T10:30:00Z",
"chunkStart": 0,
"chunkEnd": 2100,
"totalChunksInPage": 5
}
}

Key-Value Store

  • OUTPUT - Processing summary with statistics
  • rag-dataset.json - Complete dataset as single JSON file

Use Cases

1. Build a Documentation Chatbot

Crawl your docs site and create a knowledge base for a customer support bot.

2. Create a Research Assistant

Index academic papers or research sites for semantic search.

3. Power a Content Discovery Engine

Build a recommendation system based on semantic similarity.

4. Train Custom Embeddings

Use the chunks and questions to fine-tune embedding models.

LLM Cost Estimation

Using GPT-4o-mini (~$0.15/1M input tokens, ~$0.60/1M output tokens):

  • 100 pages × 5 chunks/page × 3 questions = ~$0.10-0.20

Using Claude 3 Haiku (~$0.25/1M input tokens, ~$1.25/1M output tokens):

  • 100 pages × 5 chunks/page × 3 questions = ~$0.15-0.30

Integration Examples

With LangChain

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# Load the RAG dataset
chunks = load_apify_dataset("your-run-id")
# Create documents with questions as metadata
documents = []
for chunk in chunks:
doc = Document(
page_content=chunk["chunkText"],
metadata={
"source": chunk["sourceUrl"],
"questions": chunk["hypotheticalQuestions"]
}
)
documents.append(doc)
# Create vector store
vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())

With LlamaIndex

from llama_index import Document, VectorStoreIndex
# Create documents from chunks
documents = [
Document(
text=chunk["chunkText"],
metadata={
"url": chunk["sourceUrl"],
"questions": chunk["hypotheticalQuestions"]
}
)
for chunk in chunks
]
# Build index
index = VectorStoreIndex.from_documents(documents)

Technical Details

Chunking Strategy

  • Recursive Character Splitter - Splits on semantic boundaries (paragraphs → sentences → words)
  • Token-based sizing - Uses tiktoken for accurate GPT-4 token counting
  • Overlap handling - Maintains context between chunks

Question Generation

  • Uses system prompts optimized for retrieval-focused questions
  • Generates diverse question types (what, how, why, when, etc.)
  • Questions are self-contained and specific to chunk content

License

ISC

Support

For issues or feature requests, please open an issue on the repository.