Rag Architect
Pricing
from $5.00 / 1,000 knowledge chunks
Rag Architect
Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.
Pricing
from $5.00 / 1,000 knowledge chunks
Rating
0.0
(0)
Developer

Jason Pellerin
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
19 hours ago
Last modified
Categories
Share
RAG-Architect: Automated Knowledge Engineering Factory
Transform raw web content into high-fidelity, vector-store-ready knowledge chunks with AI-generated Q&A pairs, structure-aware chunking, and PII scrubbing. The cleanroom for AI data.
Why RAG-Architect?
Most AI projects fail not because the LLM is "dumb," but because the Knowledge Base is garbage.
The Problems:
- Table Shredder: Fixed-token chunking shreds tables mid-row, confusing your AI
- Context Blindness: Chunks lose their parent context ("The fee is $50" without knowing "For Florida Residents Only")
- Metadata Rot: Old and new policies on the same site confuse the AI
- Synthetic Hallucinations: LLM-generated Q&A without grounding checks make up unanswerable questions
RAG-Architect solves all of this.
Features
Structure-Aware Chunking
- Splits by Markdown headers (
#,##,###) - not fixed tokens - Table Guard: Keeps tables whole, never splits mid-row
- Code Guard: Preserves code blocks as atomic units
- Configurable min/max chunk size with overlap
Parent-Child Context Injection
Every chunk gets a context header:
[Source: example.com | Page: Pricing Plans | Section: Enterprise Tier | Updated: 2025-01-15]The Enterprise plan includes unlimited API calls, dedicated support...
Your AI never loses its place.
Ground Truth Q&A Generator
For every chunk, generates 3-5 "battle-tested" questions using GPT-4o-mini:
- Generate candidate questions based ONLY on the chunk text
- Self-Reflection Audit: "Can this question be answered 100% by this chunk?"
- Filter out low-confidence Q&A (threshold: 0.8)
PII Scrubbing
Automatically detects and redacts:
- Email addresses
- Phone numbers
- Social Security Numbers
- Credit card numbers
- Custom patterns (regex)
- Whitelist support for domains to preserve
12 Output Formats
Drop directly into your stack of choice:
Universal Formats
| Format | Description | Best For |
|---|---|---|
| raw | Universal JSON with full metadata | Any custom integration |
| csv | Spreadsheet format with Q&A columns | Google Sheets, Excel, Airtable |
| markdown | Human-readable knowledge base docs | Documentation, wikis |
RAG Framework Formats
| Format | Description | Best For |
|---|---|---|
| langchain | LangChain Document format | Python LangChain pipelines |
| llamaindex | TextNode with relationships | LlamaIndex node graphs |
Vector Database Formats
| Format | Description | Best For |
|---|---|---|
| n8n | Vector Store Node compatible JSON | n8n workflow automation |
| pinecone | Vectors with rich metadata | Managed serverless vector search |
| weaviate | Class objects with properties | GraphQL-powered semantic search |
| supabase | pgvector rows with JSONB metadata | Postgres + vector search |
| chroma | Documents with embeddings-ready format | Local/embedded vector DB |
| qdrant | Points with payload | High-performance vector search |
| milvus | Entities for collection insert | Enterprise-scale vector DB |
Quick Start
Input
{"urls": ["https://example.com/pricing","https://example.com/features"],"openaiApiKey": "sk-...","outputFormat": "n8n","generateQA": true,"questionsPerChunk": 5,"chunkingConfig": {"splitOn": ["##", "###"],"maxChunkSize": 2000,"preserveTables": true,"preserveCodeBlocks": true},"piiConfig": {"enabled": true,"redactEmails": true,"redactPhones": true,"whitelist": ["*@mycompany.com"]}}
Output (n8n format)
{"documents": [{"id": "chunk_abc123","text": "[Source: example.com | Page: Pricing | Section: Enterprise]\n\nThe Enterprise plan includes...","metadata": {"source_url": "https://example.com/pricing","title": "Pricing Plans","section": "Enterprise","parent_path": "Pricing > Enterprise","word_count": 156,"chunk_index": 3,"total_chunks": 12},"questions": [{"q": "What is included in the Enterprise plan?","a": "Unlimited API calls and dedicated support","confidence": 0.95}]}],"summary": {"total_documents": 12,"total_questions": 48,"pii_redacted_count": 3,"processing_time_ms": 4521}}
Vector Database Formats
Pinecone
{"vectors": [{"id": "chunk_abc123","metadata": {"text": "The Enterprise plan includes...","source_url": "https://example.com/pricing","section": "Enterprise"}}]}
Qdrant
{"points": [{"id": "chunk_abc123","payload": {"content": "The Enterprise plan includes...","source_url": "https://example.com/pricing","questions": [...]}}]}
Chroma
{"documents": ["The Enterprise plan includes..."],"metadatas": [{"source_url": "...", "section": "..."}],"ids": ["chunk_abc123"]}
Milvus
{"entities": [{"id": "chunk_abc123","content": "The Enterprise plan includes...","metadata": {...}}]}
Configuration Options
Chunking Config
| Option | Default | Description |
|---|---|---|
splitOn | ["##", "###"] | Markdown header levels to split on |
minChunkSize | 100 | Minimum characters per chunk |
maxChunkSize | 2000 | Maximum characters per chunk |
overlapSize | 50 | Characters to overlap between chunks |
preserveTables | true | Keep tables as atomic units |
preserveCodeBlocks | true | Keep code blocks as atomic units |
PII Config
| Option | Default | Description |
|---|---|---|
enabled | true | Enable PII scrubbing |
redactEmails | true | Redact email addresses |
redactPhones | true | Redact phone numbers |
redactSSN | true | Redact Social Security Numbers |
redactCreditCards | true | Redact credit card numbers |
whitelist | [] | Patterns to preserve (e.g., *@company.com) |
customPatterns | [] | Custom regex patterns to redact |
Other Options
| Option | Default | Description |
|---|---|---|
outputFormat | raw | Output format (12 options - see above) |
generateQA | true | Generate Q&A pairs for each chunk |
questionsPerChunk | 3 | Number of Q&A pairs per chunk (1-10) |
stealthLevel | 2 | Anti-bot protection (1-3) |
waitForTimeout | 30000 | Page load timeout in ms |
Note: OpenAI API key is only required when generateQA: true. Set generateQA: false for faster, cheaper runs without Q&A generation.
n8n Integration
RAG-Architect output drops directly into the n8n Vector Store Node:
[RAG-Architect Actor] → [HTTP Request] → [Vector Store Node] → [Pinecone/Weaviate/Supabase]
Example n8n Workflow
- HTTP Request Node: Call RAG-Architect Actor
- Split In Batches: Process documents in batches
- OpenAI Embeddings: Generate embeddings
- Vector Store Insert: Store in your database
Pricing
Pay-per-use on Apify platform (compute costs only)
| Mode | Avg Processing Time | Est. Cost |
|---|---|---|
| With Q&A (generateQA: true) | ~30s per URL | ~$0.02-0.05 per URL |
| Without Q&A (generateQA: false) | ~8s per URL | ~$0.01 per URL |
| OpenAI API (your key) | N/A | ~$0.002 per chunk |
Example: 100 URLs with Q&A → ~$5 Apify + ~$2 OpenAI = ~$7 total
vs. Website Content Crawler
| Feature | Website Content Crawler | RAG-Architect |
|---|---|---|
| Chunking | Fixed token count | Structure-aware (headers) |
| Tables | May split mid-row | Preserved whole |
| Context | Lost between chunks | Injected header |
| Q&A | None | AI-generated with audit |
| PII | None | Auto-scrubbed |
| Output | Raw text | Vector-store-ready JSON |
Technical Architecture
URL Input↓┌─────────────────────────────────────┐│ Playwright Crawler ││ (Stealth Mode + Anti-Bot Evasion) │└─────────────────────────────────────┘↓┌─────────────────────────────────────┐│ Content Extraction ││ (Readability.js + Metadata) │└─────────────────────────────────────┘↓┌─────────────────────────────────────┐│ Structure-Aware Chunking ││ • Header Splitter ││ • Table Guard ││ • Code Guard ││ • Context Injector │└─────────────────────────────────────┘↓┌─────────────────────────────────────┐│ Enrichment Layer ││ • Q&A Generator (GPT-4o-mini) ││ • Self-Reflection Audit ││ • PII Scrubber │└─────────────────────────────────────┘↓┌─────────────────────────────────────┐│ Output Formatter (12) ││ raw | csv | markdown | langchain ││ llamaindex | n8n | pinecone ││ weaviate | supabase | chroma ││ qdrant | milvus │└─────────────────────────────────────┘↓Ready for AI
Use Cases
- AI Chatbot Knowledge Bases: Build hallucination-free chatbots
- Enterprise RAG Systems: Clean, compliant knowledge bases
- Competitive Intelligence: Extract structured intel from competitor sites
- Documentation Processing: Convert docs to searchable knowledge
- Legal/Medical Compliance: PII-scrubbed, audit-ready data
Requirements
- Apify account
- OpenAI API key (for Q&A generation)
- Vector database (optional)
Support
- Author: Jason Pellerin (AI Solutionist)
- Issues: Report on Apify Actor page
- Website: jasonpellerin.com
License
MIT License - Use freely for commercial and personal projects.
Built for the "Nerd" (Agency Owner or Dev) who's drowning in "Data Debt." RAG-Architect: The cleanroom for AI data.