# RAG Doctor: Audit & Repair Your AI Knowledge Base (`sanya_kumari/rag-doctor`) Actor

Audit and repair the content you feed your AI. Finds contradictions, stale facts, duplicates, dead links, and broken chunks that quietly poison RAG, agents, and custom GPTs. Returns a scored report, a prioritized fix list, and a cleaned, ready-to-index knowledge base.

- **URL**: https://apify.com/sanya\_kumari/rag-doctor.md
- **Developed by:** [Sanya Kumari](https://apify.com/sanya_kumari) (community)
- **Categories:** AI, Agents
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## RAG Doctor — Knowledge Base Health Check & Repair for AI

Your AI is only as good as the content you feed it. RAG Doctor audits that content the way a linter audits code: it finds the contradictions, stale facts, duplicates, and broken chunks that quietly poison RAG pipelines, agents, and custom GPTs, then hands you a prioritized fix list and an optional cleaned-up version.

Most tools **build** a knowledge base for you. RAG Doctor **fixes** the one you already have.

### Why this exists

Garbage in, confident garbage out. When two pages disagree, a RAG system retrieves one at random and the model states it as fact. When a chunk reads "as shown above," retrieval pulls it alone and the model fills the gap by guessing. These defects are invisible until a user gets a wrong answer. RAG Doctor surfaces them before your users do.

### What it checks

| Check | What it catches | Needs API key |
|-------|-----------------|:---:|
| **Contradictions** | Two pages stating facts that can't both be true (the #1 silent RAG killer) | Yes |
| **Stale facts** | Pages whose newest referenced date is past your freshness threshold | No |
| **Duplicates** | Near-identical pages that crowd out distinct facts at retrieval time | No |
| **Chunk health** | Chunks that lose meaning when retrieved alone (dangling references, orphan pronouns, too short) | No |
| **Dead links** | Cited URLs that 404 or time out | No |
| **AI extractability** | robots.txt blocking AI crawlers, missing sitemap, JavaScript-only content | No |
| **Coverage gaps** | Real user questions the knowledge base cannot answer | Yes |

### Input

Point it at a site to crawl, or hand it a dataset you already extracted.

```json
{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "maxPages": 100,
  "maxCrawlDepth": 2,
  "mode": "audit",
  "checks": ["staleness", "duplicates", "chunkHealth", "deadLinks", "extractability", "contradictions"],
  "stalenessThresholdDays": 540,
  "similarityThreshold": 0.85,
  "userQuestions": ["How do I rotate my API token?"],
  "anthropicApiKey": "sk-ant-...",
  "llmModel": "claude-haiku-4-5-20251001"
}
````

Crawling and link checks run over the Apify datacenter proxy automatically; there is no proxy option to configure.

Audit content you already crawled (composes with `apify/website-content-crawler`):

```json
{ "datasetId": "YOUR_DATASET_ID", "mode": "both" }
```

The LLM-backed checks (contradictions, coverage gaps) need an Anthropic API key. Without it, those two checks are skipped and every other check still runs.

### Output

- **Dataset** — one row per finding (severity, check, issue, detail, suggested fix, URL). Sorted most-severe first.
- **Key-value store**
  - `REPORT` — a shareable HTML report with the AI-readiness score and full fix list.
  - `SUMMARY` / `OUTPUT` — the score, grade, and severity counts as JSON.
- **`repaired-knowledge-base` dataset** (repair / both modes) — duplicates collapsed, thin pages dropped, stale pages flagged, content pre-chunked and ready for a vector DB or `llms.txt`.

The **AI-readiness score (0-100)** is defect density, not raw count, so a large knowledge base isn't penalized just for having more pages.

### Modes

- `audit` — report and fix list only.
- `repair` — also emit the cleaned corpus.
- `both` — everything.

### Local development

```bash
npm install
npm run build
apify run   # or: npm start
```

### Roadmap

- Expose as an MCP server tool (`audit_knowledge_base`) so an agent can call it mid-workflow before answering.
- Embedding-based duplicate and contradiction candidate selection for higher recall.
- Incremental re-audits that only re-check what changed.

# Actor input Schema

## `startUrls` (type: `array`):

Documentation, help center, or any pages you feed your AI. RAG Doctor crawls these and audits the content. Leave empty if you pass an existing dataset instead.

## `datasetId` (type: `string`):

Instead of crawling, audit content you already extracted. Expects items with a `url` and a `text` (or `markdown`) field. Composes with apify/website-content-crawler.

## `maxPages` (type: `integer`):

Upper bound on pages crawled when using Start URLs.

## `maxCrawlDepth` (type: `integer`):

How many link hops from the Start URLs to follow. 0 audits only the given URLs.

## `mode` (type: `string`):

audit returns a scored report + fix list. repair also outputs a cleaned, deduplicated knowledge set. both does everything.

## `checks` (type: `array`):

Which health checks to run. Contradictions require an Anthropic API key.

## `stalenessThresholdDays` (type: `integer`):

Flag content whose newest detected date is older than this.

## `similarityThreshold` (type: `number`):

Jaccard similarity (0-1) above which two pages are treated as near-duplicates.

## `userQuestions` (type: `array`):

Optional. Questions your users actually ask. Enables coverage-gap detection: which questions your knowledge base cannot answer. Requires an Anthropic API key.

## `anthropicApiKey` (type: `string`):

Required for contradiction detection and coverage gaps. Without it those checks are skipped and the rest still run.

## `llmModel` (type: `string`):

Anthropic model for the LLM-backed checks.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://docs.apify.com/platform"
    }
  ],
  "maxPages": 100,
  "maxCrawlDepth": 2,
  "mode": "audit",
  "checks": [
    "staleness",
    "duplicates",
    "chunkHealth",
    "deadLinks",
    "extractability",
    "contradictions"
  ],
  "stalenessThresholdDays": 540,
  "similarityThreshold": 0.85,
  "llmModel": "claude-haiku-4-5-20251001"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://docs.apify.com/platform"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("sanya_kumari/rag-doctor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://docs.apify.com/platform" }] }

# Run the Actor and wait for it to finish
run = client.actor("sanya_kumari/rag-doctor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://docs.apify.com/platform"
    }
  ]
}' |
apify call sanya_kumari/rag-doctor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=sanya_kumari/rag-doctor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "RAG Doctor: Audit & Repair Your AI Knowledge Base",
        "description": "Audit and repair the content you feed your AI. Finds contradictions, stale facts, duplicates, dead links, and broken chunks that quietly poison RAG, agents, and custom GPTs. Returns a scored report, a prioritized fix list, and a cleaned, ready-to-index knowledge base.",
        "version": "0.1",
        "x-build-id": "2MaGjaihex2sNiGDV"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/sanya_kumari~rag-doctor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-sanya_kumari-rag-doctor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/sanya_kumari~rag-doctor/runs": {
            "post": {
                "operationId": "runs-sync-sanya_kumari-rag-doctor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/sanya_kumari~rag-doctor/run-sync": {
            "post": {
                "operationId": "run-sync-sanya_kumari-rag-doctor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "Documentation, help center, or any pages you feed your AI. RAG Doctor crawls these and audits the content. Leave empty if you pass an existing dataset instead.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "datasetId": {
                        "title": "Existing dataset ID",
                        "type": "string",
                        "description": "Instead of crawling, audit content you already extracted. Expects items with a `url` and a `text` (or `markdown`) field. Composes with apify/website-content-crawler."
                    },
                    "maxPages": {
                        "title": "Max pages to crawl",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Upper bound on pages crawled when using Start URLs.",
                        "default": 100
                    },
                    "maxCrawlDepth": {
                        "title": "Max crawl depth",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "How many link hops from the Start URLs to follow. 0 audits only the given URLs.",
                        "default": 2
                    },
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "audit",
                            "repair",
                            "both"
                        ],
                        "type": "string",
                        "description": "audit returns a scored report + fix list. repair also outputs a cleaned, deduplicated knowledge set. both does everything.",
                        "default": "audit"
                    },
                    "checks": {
                        "title": "Checks to run",
                        "type": "array",
                        "description": "Which health checks to run. Contradictions require an Anthropic API key.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "staleness",
                                "duplicates",
                                "chunkHealth",
                                "deadLinks",
                                "extractability",
                                "contradictions",
                                "coverageGaps"
                            ],
                            "enumTitles": [
                                "Stale facts",
                                "Duplicates",
                                "Chunk health",
                                "Dead links",
                                "AI extractability",
                                "Contradictions (LLM)",
                                "Coverage gaps (LLM)"
                            ]
                        },
                        "default": [
                            "staleness",
                            "duplicates",
                            "chunkHealth",
                            "deadLinks",
                            "extractability",
                            "contradictions"
                        ]
                    },
                    "stalenessThresholdDays": {
                        "title": "Staleness threshold (days)",
                        "minimum": 30,
                        "maximum": 3650,
                        "type": "integer",
                        "description": "Flag content whose newest detected date is older than this.",
                        "default": 540
                    },
                    "similarityThreshold": {
                        "title": "Duplicate similarity threshold",
                        "minimum": 0.5,
                        "maximum": 1,
                        "type": "number",
                        "description": "Jaccard similarity (0-1) above which two pages are treated as near-duplicates.",
                        "default": 0.85
                    },
                    "userQuestions": {
                        "title": "Real user questions (coverage gaps)",
                        "type": "array",
                        "description": "Optional. Questions your users actually ask. Enables coverage-gap detection: which questions your knowledge base cannot answer. Requires an Anthropic API key.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "anthropicApiKey": {
                        "title": "Anthropic API key",
                        "type": "string",
                        "description": "Required for contradiction detection and coverage gaps. Without it those checks are skipped and the rest still run."
                    },
                    "llmModel": {
                        "title": "LLM model",
                        "enum": [
                            "claude-haiku-4-5-20251001",
                            "claude-sonnet-4-6"
                        ],
                        "type": "string",
                        "description": "Anthropic model for the LLM-backed checks.",
                        "default": "claude-haiku-4-5-20251001"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
