Rag Knowledge Graph Builder

Pricing

from $0.01 / 1,000 results

Try for free

Go to Apify Store

Rag Knowledge Graph Builder

Try for free

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(3)

Developer

csp

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

RAG-Ready Knowledge Graph Builder

🧠 Transform any website into a semantic dataset optimized for Retrieval-Augmented Generation (RAG)

The Problem

Traditional web scrapers produce giant walls of text (like llms-full.txt). For large sites, this approach has critical limitations:

Exceeds context windows of most LLMs
Models get "lost in the middle" of long documents
Raw text provides no semantic structure for retrieval
Poor retrieval accuracy in RAG pipelines

The Solution

This Actor creates a pre-indexed semantic dataset that AI agents can ingest instantly with high accuracy:

Intelligent Crawling - Crawls websites following same-domain links
Semantic Chunking - Uses recursive character splitting to create logical segments (500-1000 tokens)
Hypothetical Question Generation - For every chunk, generates potential user questions using LLM
RAG-Ready Output - Structured JSON where each object contains chunk text, source URL, and hypothetical questions

Why It's Better

Instead of raw data, you get pre-indexed data that skyrockets retrieval accuracy:

{
  "chunkId": "abc123_0",
  "chunkText": "Apify is a platform for web scraping and automation...",
  "sourceUrl": "https://docs.apify.com/platform",
  "hypotheticalQuestions": [
    "What is Apify used for?",
    "How does Apify help with web scraping?",
    "What automation capabilities does Apify provide?"
  ],
  "tokenCount": 487,
  "metadata": {
    "pageTitle": "Apify Platform Overview",
    "crawledAt": "2024-01-15T10:30:00Z"
  }
}

Input Configuration

Parameter	Type	Default	Description
`startUrls`	array	required	URLs to start crawling from
`maxCrawlPages`	integer	50	Maximum pages to crawl (0 = unlimited)
`maxCrawlDepth`	integer	3	Maximum link depth from start URLs
`chunkSize`	integer	750	Target chunk size in tokens
`chunkOverlap`	integer	100	Overlapping tokens between chunks
`questionsPerChunk`	integer	3	Hypothetical questions per chunk
`llmProvider`	string	"openai"	LLM provider (openai/anthropic)
`llmModel`	string	"gpt-4o-mini"	Model for question generation
`openaiApiKey`	string	-	OpenAI API key (required for OpenAI)
`anthropicApiKey`	string	-	Anthropic API key (required for Anthropic)
`excludeSelectors`	array	[...]	CSS selectors to exclude
`urlPatterns`	array	[]	URL patterns to include
`excludeUrlPatterns`	array	[...]	URL patterns to exclude

Output

Dataset (per chunk)

{
  "chunkId": "unique_chunk_identifier",
  "chunkIndex": 0,
  "chunkText": "The actual text content...",
  "tokenCount": 523,
  "sourceUrl": "https://example.com/page",
  "pageTitle": "Page Title",
  "pageDescription": "Meta description",
  "hypotheticalQuestions": [
    "Question 1?",
    "Question 2?",
    "Question 3?"
  ],
  "questionsCount": 3,
  "metadata": {
    "crawledAt": "2024-01-15T10:30:00Z",
    "chunkStart": 0,
    "chunkEnd": 2100,
    "totalChunksInPage": 5
  }
}

Key-Value Store

OUTPUT - Processing summary with statistics
rag-dataset.json - Complete dataset as single JSON file

Use Cases

1. Build a Documentation Chatbot

Crawl your docs site and create a knowledge base for a customer support bot.

2. Create a Research Assistant

Index academic papers or research sites for semantic search.

3. Power a Content Discovery Engine

Build a recommendation system based on semantic similarity.

4. Train Custom Embeddings

Use the chunks and questions to fine-tune embedding models.

LLM Cost Estimation

Using GPT-4o-mini (~$0.15/1M input tokens, ~$0.60/1M output tokens):

100 pages × 5 chunks/page × 3 questions = ~$0.10-0.20

Using Claude 3 Haiku (~$0.25/1M input tokens, ~$1.25/1M output tokens):

100 pages × 5 chunks/page × 3 questions = ~$0.15-0.30

Integration Examples

With LangChain

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Load the RAG dataset
chunks = load_apify_dataset("your-run-id")

# Create documents with questions as metadata
documents = []
for chunk in chunks:
    doc = Document(
        page_content=chunk["chunkText"],
        metadata={
            "source": chunk["sourceUrl"],
            "questions": chunk["hypotheticalQuestions"]
        }
    )
    documents.append(doc)

# Create vector store
vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())

With LlamaIndex

from llama_index import Document, VectorStoreIndex

# Create documents from chunks
documents = [
    Document(
        text=chunk["chunkText"],
        metadata={
            "url": chunk["sourceUrl"],
            "questions": chunk["hypotheticalQuestions"]
        }
    )
    for chunk in chunks
]

# Build index
index = VectorStoreIndex.from_documents(documents)

Technical Details

Chunking Strategy

Recursive Character Splitter - Splits on semantic boundaries (paragraphs → sentences → words)
Token-based sizing - Uses tiktoken for accurate GPT-4 token counting
Overlap handling - Maintains context between chunks

Question Generation

Uses system prompts optimized for retrieval-focused questions
Generates diverse question types (what, how, why, when, etc.)
Questions are self-contained and specific to chunk content

License

ISC

Support

For issues or feature requests, please open an issue on the repository.

Review to Image Generator

happitap/review-to-image-generator

Instantly transform customer reviews from Google, Facebook, Trustpilot, etc.. into stunning social media images. Boost social proof with this fully customizable Review to Image Generator featuring auto-fetch capabilities, custom branding, and multi-platform support

HappiTap

5.0

Indian Algo Trading Data

cspnair/Indian-algo-trading-data

Collect real-time and historical Indian market data from 11 sources and generate trading signals using 8 advanced models. Includes indices, OI, FII/DII, sentiment, news, and global data. Perfect for algo trading, strategy building, and automated market analysis

csp

5.0

Pdf OCR API

cspnair/pdf-ocr-api

Extract and convert text from PDF documents using advanced optical character recognition technology with support for multiple AI models.

csp

5.0

Chroma Integration

apify/chroma-integration

This integration transfers data from Apify Actors to a Chroma and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.5

Favicon & Brand Asset Checker

pillowy_travel/favicon-brand-asset-checker

Checks favicon presence and basic brand asset indicators on a website.

SAHIL KUMAR

Modern Manga Colorizer

parseforge/modern-manga-colorizer

AI-powered colorization tool for black and white manga panels. Upload manga panel images and get them automatically colorized with vibrant, appropriate colors that match manga art style. Maintains original line art and shading while applying professional colorization.

ParseForge

170

5.0

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

WordPress Universal Content Bridge

visita/wordpress-universal-content-bridge

The WordPress Universal Content Bridge is a specialized tool designed to solve the #1 problem in WordPress automation: Firewalls. Securely import AI articles, WooCommerce products, and Directory listings without getting blocked.

Visita AI & Automation

N8n Workflow Server

dz_omar/n8n-workflow-server

Run ☁️ n8n workflow automation on Apify without hosting setup. Three modes: interactive dashboard with auto-backup 💾, headless webhook execution for CI/CD 🔧, or persistent standby server. Your workflows and credentials survive between runs. Pay only for compute time used.

FlowExtract API

5.0

Social Card Maker

happitap/social-card-maker

Generate polished social-card images (link previews, posts, comments, reviews) across platforms — customizable layout, branding, and downloadable image outputs (PNG) ready to post.

HappiTap

5.0

Rag Knowledge Graph Builder

Rag Knowledge Graph Builder

RAG-Ready Knowledge Graph Builder

The Problem

The Solution

Why It's Better

Input Configuration

Output

Dataset (per chunk)

Key-Value Store

Use Cases

1. Build a Documentation Chatbot

2. Create a Research Assistant

3. Power a Content Discovery Engine

4. Train Custom Embeddings

LLM Cost Estimation

Integration Examples

With LangChain

With LlamaIndex

Technical Details

Chunking Strategy

Question Generation

License

Support

You might also like

Review to Image Generator

Indian Algo Trading Data

Pdf OCR API

Chroma Integration

Favicon & Brand Asset Checker

Modern Manga Colorizer

Docs To Rag

WordPress Universal Content Bridge

N8n Workflow Server

Social Card Maker