Rag Architect avatar
Rag Architect

Pricing

from $5.00 / 1,000 knowledge chunks

Go to Apify Store
Rag Architect

Rag Architect

Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.

Pricing

from $5.00 / 1,000 knowledge chunks

Rating

0.0

(0)

Developer

Jason Pellerin

Jason Pellerin

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

19 hours ago

Last modified

Share

RAG-Architect: Automated Knowledge Engineering Factory

Transform raw web content into high-fidelity, vector-store-ready knowledge chunks with AI-generated Q&A pairs, structure-aware chunking, and PII scrubbing. The cleanroom for AI data.

Why RAG-Architect?

Most AI projects fail not because the LLM is "dumb," but because the Knowledge Base is garbage.

The Problems:

  • Table Shredder: Fixed-token chunking shreds tables mid-row, confusing your AI
  • Context Blindness: Chunks lose their parent context ("The fee is $50" without knowing "For Florida Residents Only")
  • Metadata Rot: Old and new policies on the same site confuse the AI
  • Synthetic Hallucinations: LLM-generated Q&A without grounding checks make up unanswerable questions

RAG-Architect solves all of this.

Features

Structure-Aware Chunking

  • Splits by Markdown headers (#, ##, ###) - not fixed tokens
  • Table Guard: Keeps tables whole, never splits mid-row
  • Code Guard: Preserves code blocks as atomic units
  • Configurable min/max chunk size with overlap

Parent-Child Context Injection

Every chunk gets a context header:

[Source: example.com | Page: Pricing Plans | Section: Enterprise Tier | Updated: 2025-01-15]
The Enterprise plan includes unlimited API calls, dedicated support...

Your AI never loses its place.

Ground Truth Q&A Generator

For every chunk, generates 3-5 "battle-tested" questions using GPT-4o-mini:

  1. Generate candidate questions based ONLY on the chunk text
  2. Self-Reflection Audit: "Can this question be answered 100% by this chunk?"
  3. Filter out low-confidence Q&A (threshold: 0.8)

PII Scrubbing

Automatically detects and redacts:

  • Email addresses
  • Phone numbers
  • Social Security Numbers
  • Credit card numbers
  • Custom patterns (regex)
  • Whitelist support for domains to preserve

12 Output Formats

Drop directly into your stack of choice:

Universal Formats

FormatDescriptionBest For
rawUniversal JSON with full metadataAny custom integration
csvSpreadsheet format with Q&A columnsGoogle Sheets, Excel, Airtable
markdownHuman-readable knowledge base docsDocumentation, wikis

RAG Framework Formats

FormatDescriptionBest For
langchainLangChain Document formatPython LangChain pipelines
llamaindexTextNode with relationshipsLlamaIndex node graphs

Vector Database Formats

FormatDescriptionBest For
n8nVector Store Node compatible JSONn8n workflow automation
pineconeVectors with rich metadataManaged serverless vector search
weaviateClass objects with propertiesGraphQL-powered semantic search
supabasepgvector rows with JSONB metadataPostgres + vector search
chromaDocuments with embeddings-ready formatLocal/embedded vector DB
qdrantPoints with payloadHigh-performance vector search
milvusEntities for collection insertEnterprise-scale vector DB

Quick Start

Input

{
"urls": [
"https://example.com/pricing",
"https://example.com/features"
],
"openaiApiKey": "sk-...",
"outputFormat": "n8n",
"generateQA": true,
"questionsPerChunk": 5,
"chunkingConfig": {
"splitOn": ["##", "###"],
"maxChunkSize": 2000,
"preserveTables": true,
"preserveCodeBlocks": true
},
"piiConfig": {
"enabled": true,
"redactEmails": true,
"redactPhones": true,
"whitelist": ["*@mycompany.com"]
}
}

Output (n8n format)

{
"documents": [
{
"id": "chunk_abc123",
"text": "[Source: example.com | Page: Pricing | Section: Enterprise]\n\nThe Enterprise plan includes...",
"metadata": {
"source_url": "https://example.com/pricing",
"title": "Pricing Plans",
"section": "Enterprise",
"parent_path": "Pricing > Enterprise",
"word_count": 156,
"chunk_index": 3,
"total_chunks": 12
},
"questions": [
{
"q": "What is included in the Enterprise plan?",
"a": "Unlimited API calls and dedicated support",
"confidence": 0.95
}
]
}
],
"summary": {
"total_documents": 12,
"total_questions": 48,
"pii_redacted_count": 3,
"processing_time_ms": 4521
}
}

Vector Database Formats

Pinecone

{
"vectors": [
{
"id": "chunk_abc123",
"metadata": {
"text": "The Enterprise plan includes...",
"source_url": "https://example.com/pricing",
"section": "Enterprise"
}
}
]
}

Qdrant

{
"points": [
{
"id": "chunk_abc123",
"payload": {
"content": "The Enterprise plan includes...",
"source_url": "https://example.com/pricing",
"questions": [...]
}
}
]
}

Chroma

{
"documents": ["The Enterprise plan includes..."],
"metadatas": [{"source_url": "...", "section": "..."}],
"ids": ["chunk_abc123"]
}

Milvus

{
"entities": [
{
"id": "chunk_abc123",
"content": "The Enterprise plan includes...",
"metadata": {...}
}
]
}

Configuration Options

Chunking Config

OptionDefaultDescription
splitOn["##", "###"]Markdown header levels to split on
minChunkSize100Minimum characters per chunk
maxChunkSize2000Maximum characters per chunk
overlapSize50Characters to overlap between chunks
preserveTablestrueKeep tables as atomic units
preserveCodeBlockstrueKeep code blocks as atomic units

PII Config

OptionDefaultDescription
enabledtrueEnable PII scrubbing
redactEmailstrueRedact email addresses
redactPhonestrueRedact phone numbers
redactSSNtrueRedact Social Security Numbers
redactCreditCardstrueRedact credit card numbers
whitelist[]Patterns to preserve (e.g., *@company.com)
customPatterns[]Custom regex patterns to redact

Other Options

OptionDefaultDescription
outputFormatrawOutput format (12 options - see above)
generateQAtrueGenerate Q&A pairs for each chunk
questionsPerChunk3Number of Q&A pairs per chunk (1-10)
stealthLevel2Anti-bot protection (1-3)
waitForTimeout30000Page load timeout in ms

Note: OpenAI API key is only required when generateQA: true. Set generateQA: false for faster, cheaper runs without Q&A generation.

n8n Integration

RAG-Architect output drops directly into the n8n Vector Store Node:

[RAG-Architect Actor][HTTP Request][Vector Store Node][Pinecone/Weaviate/Supabase]

Example n8n Workflow

  1. HTTP Request Node: Call RAG-Architect Actor
  2. Split In Batches: Process documents in batches
  3. OpenAI Embeddings: Generate embeddings
  4. Vector Store Insert: Store in your database

Pricing

Pay-per-use on Apify platform (compute costs only)

ModeAvg Processing TimeEst. Cost
With Q&A (generateQA: true)~30s per URL~$0.02-0.05 per URL
Without Q&A (generateQA: false)~8s per URL~$0.01 per URL
OpenAI API (your key)N/A~$0.002 per chunk

Example: 100 URLs with Q&A → ~$5 Apify + ~$2 OpenAI = ~$7 total

vs. Website Content Crawler

FeatureWebsite Content CrawlerRAG-Architect
ChunkingFixed token countStructure-aware (headers)
TablesMay split mid-rowPreserved whole
ContextLost between chunksInjected header
Q&ANoneAI-generated with audit
PIINoneAuto-scrubbed
OutputRaw textVector-store-ready JSON

Technical Architecture

URL Input
┌─────────────────────────────────────┐
│ Playwright Crawler │
(Stealth Mode + Anti-Bot Evasion)
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Content Extraction │
(Readability.js + Metadata)
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Structure-Aware Chunking │
│ • Header Splitter │
│ • Table Guard │
│ • Code Guard │
│ • Context Injector │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Enrichment Layer │
│ • Q&A Generator (GPT-4o-mini)
│ • Self-Reflection Audit │
│ • PII Scrubber │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Output Formatter (12)
│ raw | csv | markdown | langchain │
│ llamaindex | n8n | pinecone │
│ weaviate | supabase | chroma │
│ qdrant | milvus │
└─────────────────────────────────────┘
Ready for AI

Use Cases

  1. AI Chatbot Knowledge Bases: Build hallucination-free chatbots
  2. Enterprise RAG Systems: Clean, compliant knowledge bases
  3. Competitive Intelligence: Extract structured intel from competitor sites
  4. Documentation Processing: Convert docs to searchable knowledge
  5. Legal/Medical Compliance: PII-scrubbed, audit-ready data

Requirements

  • Apify account
  • OpenAI API key (for Q&A generation)
  • Vector database (optional)

Support

  • Author: Jason Pellerin (AI Solutionist)
  • Issues: Report on Apify Actor page
  • Website: jasonpellerin.com

License

MIT License - Use freely for commercial and personal projects.


Built for the "Nerd" (Agency Owner or Dev) who's drowning in "Data Debt." RAG-Architect: The cleanroom for AI data.