RAG Doctor: Audit & Repair Your AI Knowledge Base
Pricing
Pay per usage
RAG Doctor: Audit & Repair Your AI Knowledge Base
Audit and repair the content you feed your AI. Finds contradictions, stale facts, duplicates, dead links, and broken chunks that quietly poison RAG, agents, and custom GPTs. Returns a scored report, a prioritized fix list, and a cleaned, ready-to-index knowledge base.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Sanya Kumari
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Share
RAG Doctor — Knowledge Base Health Check & Repair for AI
Your AI is only as good as the content you feed it. RAG Doctor audits that content the way a linter audits code: it finds the contradictions, stale facts, duplicates, and broken chunks that quietly poison RAG pipelines, agents, and custom GPTs, then hands you a prioritized fix list and an optional cleaned-up version.
Most tools build a knowledge base for you. RAG Doctor fixes the one you already have.
Why this exists
Garbage in, confident garbage out. When two pages disagree, a RAG system retrieves one at random and the model states it as fact. When a chunk reads "as shown above," retrieval pulls it alone and the model fills the gap by guessing. These defects are invisible until a user gets a wrong answer. RAG Doctor surfaces them before your users do.
What it checks
| Check | What it catches | Needs API key |
|---|---|---|
| Contradictions | Two pages stating facts that can't both be true (the #1 silent RAG killer) | Yes |
| Stale facts | Pages whose newest referenced date is past your freshness threshold | No |
| Duplicates | Near-identical pages that crowd out distinct facts at retrieval time | No |
| Chunk health | Chunks that lose meaning when retrieved alone (dangling references, orphan pronouns, too short) | No |
| Dead links | Cited URLs that 404 or time out | No |
| AI extractability | robots.txt blocking AI crawlers, missing sitemap, JavaScript-only content | No |
| Coverage gaps | Real user questions the knowledge base cannot answer | Yes |
Input
Point it at a site to crawl, or hand it a dataset you already extracted.
{"startUrls": [{ "url": "https://docs.example.com" }],"maxPages": 100,"maxCrawlDepth": 2,"mode": "audit","checks": ["staleness", "duplicates", "chunkHealth", "deadLinks", "extractability", "contradictions"],"stalenessThresholdDays": 540,"similarityThreshold": 0.85,"userQuestions": ["How do I rotate my API token?"],"anthropicApiKey": "sk-ant-...","llmModel": "claude-haiku-4-5-20251001"}
Crawling and link checks run over the Apify datacenter proxy automatically; there is no proxy option to configure.
Audit content you already crawled (composes with apify/website-content-crawler):
{ "datasetId": "YOUR_DATASET_ID", "mode": "both" }
The LLM-backed checks (contradictions, coverage gaps) need an Anthropic API key. Without it, those two checks are skipped and every other check still runs.
Output
- Dataset — one row per finding (severity, check, issue, detail, suggested fix, URL). Sorted most-severe first.
- Key-value store
REPORT— a shareable HTML report with the AI-readiness score and full fix list.SUMMARY/OUTPUT— the score, grade, and severity counts as JSON.
repaired-knowledge-basedataset (repair / both modes) — duplicates collapsed, thin pages dropped, stale pages flagged, content pre-chunked and ready for a vector DB orllms.txt.
The AI-readiness score (0-100) is defect density, not raw count, so a large knowledge base isn't penalized just for having more pages.
Modes
audit— report and fix list only.repair— also emit the cleaned corpus.both— everything.
Local development
npm installnpm run buildapify run # or: npm start
Roadmap
- Expose as an MCP server tool (
audit_knowledge_base) so an agent can call it mid-workflow before answering. - Embedding-based duplicate and contradiction candidate selection for higher recall.
- Incremental re-audits that only re-check what changed.