Pricing

from $5.00 / 1,000 knowledge chunks

Rag Architect

Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.

Pricing

from $5.00 / 1,000 knowledge chunks

Rating

0.0

(0)

Developer

Jason Pellerin

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

RAG-Architect: Automated Knowledge Engineering Factory

Transform raw web content into high-fidelity, vector-store-ready knowledge chunks with AI-generated Q&A pairs, structure-aware chunking, and PII scrubbing. The cleanroom for AI data.

Why RAG-Architect?

Most AI projects fail not because the LLM is "dumb," but because the Knowledge Base is garbage.

The Problems:

Table Shredder: Fixed-token chunking shreds tables mid-row, confusing your AI
Context Blindness: Chunks lose their parent context ("The fee is $50" without knowing "For Florida Residents Only")
Metadata Rot: Old and new policies on the same site confuse the AI
Synthetic Hallucinations: LLM-generated Q&A without grounding checks make up unanswerable questions

RAG-Architect solves all of this.

Features

Structure-Aware Chunking

Splits by Markdown headers (#, ##, ###) - not fixed tokens
Table Guard: Keeps tables whole, never splits mid-row
Code Guard: Preserves code blocks as atomic units
Configurable min/max chunk size with overlap

Parent-Child Context Injection

Every chunk gets a context header:

[Source: example.com | Page: Pricing Plans | Section: Enterprise Tier | Updated: 2025-01-15]

The Enterprise plan includes unlimited API calls, dedicated support...

Your AI never loses its place.

Ground Truth Q&A Generator

For every chunk, generates 3-5 "battle-tested" questions using GPT-4o-mini:

Generate candidate questions based ONLY on the chunk text
Self-Reflection Audit: "Can this question be answered 100% by this chunk?"
Filter out low-confidence Q&A (threshold: 0.8)

PII Scrubbing

Automatically detects and redacts:

Email addresses
Phone numbers
Social Security Numbers
Credit card numbers
Custom patterns (regex)
Whitelist support for domains to preserve

12 Output Formats

Drop directly into your stack of choice:

Universal Formats

Format	Description	Best For
raw	Universal JSON with full metadata	Any custom integration
csv	Spreadsheet format with Q&A columns	Google Sheets, Excel, Airtable
markdown	Human-readable knowledge base docs	Documentation, wikis

RAG Framework Formats

Format	Description	Best For
langchain	LangChain Document format	Python LangChain pipelines
llamaindex	TextNode with relationships	LlamaIndex node graphs

Vector Database Formats

Format	Description	Best For
n8n	Vector Store Node compatible JSON	n8n workflow automation
pinecone	Vectors with rich metadata	Managed serverless vector search
weaviate	Class objects with properties	GraphQL-powered semantic search
supabase	pgvector rows with JSONB metadata	Postgres + vector search
chroma	Documents with embeddings-ready format	Local/embedded vector DB
qdrant	Points with payload	High-performance vector search
milvus	Entities for collection insert	Enterprise-scale vector DB

Quick Start

Input

{
  "urls": [
    "https://example.com/pricing",
    "https://example.com/features"
  ],
  "openaiApiKey": "sk-...",
  "outputFormat": "n8n",
  "generateQA": true,
  "questionsPerChunk": 5,
  "chunkingConfig": {
    "splitOn": ["##", "###"],
    "maxChunkSize": 2000,
    "preserveTables": true,
    "preserveCodeBlocks": true
  },
  "piiConfig": {
    "enabled": true,
    "redactEmails": true,
    "redactPhones": true,
    "whitelist": ["*@mycompany.com"]
  }
}

Output (n8n format)

{
  "documents": [
    {
      "id": "chunk_abc123",
      "text": "[Source: example.com | Page: Pricing | Section: Enterprise]\n\nThe Enterprise plan includes...",
      "metadata": {
        "source_url": "https://example.com/pricing",
        "title": "Pricing Plans",
        "section": "Enterprise",
        "parent_path": "Pricing > Enterprise",
        "word_count": 156,
        "chunk_index": 3,
        "total_chunks": 12
      },
      "questions": [
        {
          "q": "What is included in the Enterprise plan?",
          "a": "Unlimited API calls and dedicated support",
          "confidence": 0.95
        }
      ]
    }
  ],
  "summary": {
    "total_documents": 12,
    "total_questions": 48,
    "pii_redacted_count": 3,
    "processing_time_ms": 4521
  }
}

Vector Database Formats

Pinecone

{
  "vectors": [
    {
      "id": "chunk_abc123",
      "metadata": {
        "text": "The Enterprise plan includes...",
        "source_url": "https://example.com/pricing",
        "section": "Enterprise"
      }
    }
  ]
}

Qdrant

{
  "points": [
    {
      "id": "chunk_abc123",
      "payload": {
        "content": "The Enterprise plan includes...",
        "source_url": "https://example.com/pricing",
        "questions": [...]
      }
    }
  ]
}

Chroma

{
  "documents": ["The Enterprise plan includes..."],
  "metadatas": [{"source_url": "...", "section": "..."}],
  "ids": ["chunk_abc123"]
}

Milvus

{
  "entities": [
    {
      "id": "chunk_abc123",
      "content": "The Enterprise plan includes...",
      "metadata": {...}
    }
  ]
}

Configuration Options

Chunking Config

Option	Default	Description
`splitOn`	`["##", "###"]`	Markdown header levels to split on
`minChunkSize`	`100`	Minimum characters per chunk
`maxChunkSize`	`2000`	Maximum characters per chunk
`overlapSize`	`50`	Characters to overlap between chunks
`preserveTables`	`true`	Keep tables as atomic units
`preserveCodeBlocks`	`true`	Keep code blocks as atomic units

PII Config

Option	Default	Description
`enabled`	`true`	Enable PII scrubbing
`redactEmails`	`true`	Redact email addresses
`redactPhones`	`true`	Redact phone numbers
`redactSSN`	`true`	Redact Social Security Numbers
`redactCreditCards`	`true`	Redact credit card numbers
`whitelist`	`[]`	Patterns to preserve (e.g., `*@company.com`)
`customPatterns`	`[]`	Custom regex patterns to redact

Other Options

Option	Default	Description
`outputFormat`	`raw`	Output format (12 options - see above)
`generateQA`	`true`	Generate Q&A pairs for each chunk
`questionsPerChunk`	`3`	Number of Q&A pairs per chunk (1-10)
`stealthLevel`	`2`	Anti-bot protection (1-3)
`waitForTimeout`	`30000`	Page load timeout in ms

Note: OpenAI API key is only required when generateQA: true. Set generateQA: false for faster, cheaper runs without Q&A generation.

n8n Integration

RAG-Architect output drops directly into the n8n Vector Store Node:

[RAG-Architect Actor] → [HTTP Request] → [Vector Store Node] → [Pinecone/Weaviate/Supabase]

Example n8n Workflow

HTTP Request Node: Call RAG-Architect Actor
Split In Batches: Process documents in batches
OpenAI Embeddings: Generate embeddings
Vector Store Insert: Store in your database

Pricing

Pay-per-use on Apify platform (compute costs only)

Mode	Avg Processing Time	Est. Cost
With Q&A (generateQA: true)	~30s per URL	~$0.02-0.05 per URL
Without Q&A (generateQA: false)	~8s per URL	~$0.01 per URL
OpenAI API (your key)	N/A	~$0.002 per chunk

Example: 100 URLs with Q&A → ~$5 Apify + ~$2 OpenAI = ~$7 total

vs. Website Content Crawler

Feature	Website Content Crawler	RAG-Architect
Chunking	Fixed token count	Structure-aware (headers)
Tables	May split mid-row	Preserved whole
Context	Lost between chunks	Injected header
Q&A	None	AI-generated with audit
PII	None	Auto-scrubbed
Output	Raw text	Vector-store-ready JSON

Technical Architecture

URL Input
    ↓
┌─────────────────────────────────────┐
│        Playwright Crawler           │
│  (Stealth Mode + Anti-Bot Evasion)  │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│        Content Extraction           │
│  (Readability.js + Metadata)        │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│      Structure-Aware Chunking       │
│  • Header Splitter                  │
│  • Table Guard                      │
│  • Code Guard                       │
│  • Context Injector                 │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│          Enrichment Layer           │
│  • Q&A Generator (GPT-4o-mini)      │
│  • Self-Reflection Audit            │
│  • PII Scrubber                     │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│        Output Formatter (12)        │
│  raw | csv | markdown | langchain   │
│  llamaindex | n8n | pinecone        │
│  weaviate | supabase | chroma       │
│  qdrant | milvus                    │
└─────────────────────────────────────┘
    ↓
Ready for AI

Use Cases

AI Chatbot Knowledge Bases: Build hallucination-free chatbots
Enterprise RAG Systems: Clean, compliant knowledge bases
Competitive Intelligence: Extract structured intel from competitor sites
Documentation Processing: Convert docs to searchable knowledge
Legal/Medical Compliance: PII-scrubbed, audit-ready data

Requirements

Apify account
OpenAI API key (for Q&A generation)
Vector database (optional)

Support

Author: Jason Pellerin (AI Solutionist)
Issues: Report on Apify Actor page
Website: jasonpellerin.com

License

MIT License - Use freely for commercial and personal projects.

Built for the "Nerd" (Agency Owner or Dev) who's drowning in "Data Debt." RAG-Architect: The cleanroom for AI data.

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

n8n-mcp

nourishing_courier/web-data-for-ai

n8n-mcp

Ani Björkström

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

n8n Workflow Automation Templates Scraper

scraped/n8n-workflow-automation-templates-scraper

A tool that automatically scrapes and collects n8n workflow automation templates from the n8n for easy access and use.

scraped

246

5.0

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

n8n Documentation MCP Server

agentify/n8n-mcp-server

n8n MCP Server provides AI assistants with structured access to n8n node documentation, properties, and validation tools for building and verifying workflows efficiently.

agentify

Screwfix - Products, Reviews, Q&As

datasaurus/screwfix

Works with all website countries and languages. n8n node: n8n-nodes-screwfix

datasaurus

Website to Markdown

logiover/website-to-markdown

Convert any URL to clean Markdown for AI & RAG. Strips ads & junk for noise-free data. Perfect for OpenAI, Pinecone & LangChain. Advanced stealth browsing bypasses anti-bots. Blazing fast, token-efficient extraction for AI Agents and Vector Stores. Your essential AI Data Architect.

Logiover

n8n-apify-bridge

jungle_thunder/n8n-apify-bridge

Turn your n8n workflows into data powerhouses. This bridge gives n8n users instant access to 2000 battle-tested Apify tools - web scrapers, AI agents, lead generators, price monitors, and more. No coding required.

Ani Björkström

5.0

n8n MCP Template

sameh.jarour/n8n-mcp-template

Sameh George Jarour

Rag Architect

RAG-Architect: Automated Knowledge Engineering Factory

Why RAG-Architect?

Features

Structure-Aware Chunking

Parent-Child Context Injection

Ground Truth Q&A Generator

PII Scrubbing

12 Output Formats

Universal Formats

RAG Framework Formats

Vector Database Formats

Quick Start

Input

Output (n8n format)

Vector Database Formats

Pinecone

Qdrant

Chroma

Milvus

Configuration Options

Chunking Config

PII Config

Other Options

n8n Integration

Example n8n Workflow

Pricing

vs. Website Content Crawler

Technical Architecture

Use Cases

Requirements

Support

License

You might also like

Web Scraper RAG Ready

n8n-mcp

Docs To Rag

n8n Workflow Automation Templates Scraper

Web-to-Markdown Generator for AI & RAG Pipelines

n8n Documentation MCP Server

Screwfix - Products, Reviews, Q&As

Website to Markdown

n8n-apify-bridge

n8n MCP Template

Related articles