# AI Training Data Quality MCP Server (`ryanclinton/ai-training-data-quality-mcp`) Actor

AI training data quality assessment, bias detection, and governance scoring for AI agents via the Model Context Protocol.

- **URL**: https://apify.com/ryanclinton/ai-training-data-quality-mcp.md
- **Developed by:** [ryan clinton](https://apify.com/ryanclinton) (community)
- **Categories:** AI, Developer tools
- **Stats:** 1 total users, 0 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Training Data Quality MCP Server

AI training data quality assessment, bias detection, and governance scoring — delivered to any MCP-compatible AI agent through a single always-on server. This server orchestrates 7 specialized data sources (dataset registries, GitHub, ArXiv, Semantic Scholar, Hacker News, Wikipedia, and Data.gov) to produce per-dataset quality scores, bias indicator reports, provenance chains, governance grades, trend rankings, and model-data fit assessments. The result is a complete intelligence layer for AI teams that need to understand, audit, and defend their training data choices.

Every tool call queries multiple sources in parallel, builds a cross-referenced data network linking datasets to their associated papers, repositories, and community discussions, and runs weighted scoring algorithms to surface the best data for your model. No API keys, no configuration — connect and query.

### ⬇️ What data can you access?

| Data Point | Source | Example |
|---|---|---|
| 📦 AI training datasets, metadata, and documentation | AI Training Data Curator | "Common Voice 17.0 (CC0, 114 languages)" |
| 💻 Open-source dataset repos and data tools | GitHub Repo Search | "huggingface/datasets — 19,400 stars" |
| 📄 AI/ML research papers referencing datasets | ArXiv Preprints | "Data-Juicer: A One-Stop Data Processing System for LLM Training" |
| 🔬 Academic papers with citation counts | Semantic Scholar | "ImageNet Large Scale Visual Recognition Challenge — 65,000+ citations" |
| 💬 Community discussions on data quality issues | Hacker News Search | "Ask HN: What training data sources do you trust?" |
| 📖 Encyclopedic context for well-known datasets | Wikipedia Search | "LAION-5B — documented controversies and retractions" |
| 🏛️ US government open data registries | Data.gov | "CDC National Health Interview Survey (Public Domain)" |

### ❓ Why use an AI training data quality MCP server?

Choosing the wrong training data is expensive. A model trained on biased, poorly licensed, or undocumented data can fail audits, produce discriminatory outputs, or expose your organization to legal liability. Manually evaluating datasets across registries, papers, and repositories takes days per domain — and still misses the cross-source context that reveals whether a dataset is genuinely trusted by the research community.

This server automates that evaluation. It queries 7 sources simultaneously, links datasets to their academic references and code implementations, applies a weighted scoring model across 5 quality dimensions, and flags bias indicators and governance gaps before you commit to a dataset. What would take a data scientist two days takes a tool call.

- **Scheduling** — Run recurring dataset quality audits on a weekly cadence to catch newly deprecated or flagged datasets
- **API access** — Integrate quality checks directly into ML pipelines via the Apify API or MCP protocol
- **Parallel source queries** — All 7 data sources are queried simultaneously, not sequentially, for faster results
- **Monitoring** — Get Slack or email alerts when governance scores drop or new bias indicators appear
- **Integrations** — Connect results to Google Sheets, Notion, or compliance documentation via Zapier or Make

### Features

- **8 specialized tools** covering the full data evaluation lifecycle: landscape mapping, quality scoring, bias detection, provenance tracing, governance grading, trend tracking, model-data fit assessment, and comprehensive reporting
- **7-source parallel querying** — simultaneously searches AI Training Data Curator, GitHub, ArXiv, Semantic Scholar, Hacker News, Wikipedia, and Data.gov with configurable result limits per source (1–100)
- **Weighted composite quality scoring** — 5-dimension model: completeness (25%), documentation (25%), license openness (20%), recency (15%), community engagement (15%)
- **7-type bias detection** — identifies geographic, demographic, temporal, linguistic, domain, sampling, and labeling biases using keyword analysis across dataset descriptions, paper abstracts, and community discussions
- **15+ bias keyword patterns** — detects specific signals including "english only", "web crawl", "reddit", "crowdsourced", "stereotype", "hate speech", "deprecated", and more, each mapped to severity levels (low/medium/high/critical)
- **License scoring matrix** — 20+ license types scored for AI training openness: CC0/public domain (100), MIT/Apache (90–95), CC-BY (90), GPL (60), CC-BY-NC (50), proprietary (10–15)
- **Cross-reference network building** — links datasets to papers, repositories, and discussions via keyword overlap detection (3+ significant word overlap threshold), inferring relationship types: `trains_on`, `evaluates`, `references`, `derived_from`, `describes`, `discusses`
- **11 model type profiles** — dedicated data requirement profiles for LLM, vision, image classification, object detection, speech recognition, translation, recommendation, reinforcement learning, multimodal, diffusion, and graph neural network
- **5-dimension governance scoring** — license compliance (25%), privacy protection (25%), documentation quality (20%), access control (15%), auditability (15%) with compliance status: `compliant` / `partial` / `non_compliant` / `unknown`
- **Provenance chain tracing** — reconstructs 4-stage data lineage: origin → research validation → implementation evidence → licensing, with integrity scores and identified gaps
- **11 data modality classifiers** — text/NLP, image/vision, audio/speech, video, tabular, multimodal, code, graph/network, geospatial, medical/health, scientific
- **Severity escalation logic** — bias severity upgrades automatically when the same indicator appears across 5 or more sources
- **Spending limit enforcement** — every tool call checks `Actor.charge()` and halts gracefully if a per-run spending cap is reached

### Use cases for AI training data quality assessment

#### Pre-training data audit for ML teams

Data scientists and ML engineers need to evaluate candidate datasets before committing to a training run that could cost thousands of dollars of compute. Running `assess_dataset_quality` and `detect_bias_indicators` before training surfaces documentation gaps, restrictive licenses, and demographic imbalances that would otherwise only surface during model evaluation — far too late. A 2-hour manual review becomes a 30-second tool call.

#### EU AI Act compliance preparation

AI governance teams preparing for EU AI Act Article 10 compliance need documented evidence that training data for high-risk systems was selected with due diligence. `score_data_governance` produces per-dataset compliance assessments across license, privacy, documentation, access, and auditability dimensions. `generate_data_quality_report` wraps all analyses into an executive summary suitable for regulatory documentation.

#### Dataset discovery and landscape mapping

Research teams entering a new domain often do not know which datasets exist, which are trusted by the community, or how they relate to each other. `map_data_landscape` builds a cross-referenced inventory from 7 sources, ranks datasets by quality, and reveals relationships between datasets and the papers that use them. Discovering that a dataset is cited in 500+ papers — or mentioned in Hacker News threads about data quality issues — is context that no single registry provides.

#### Responsible AI documentation

AI teams presenting training data decisions to boards, ethics committees, or enterprise procurement require structured documentation. `generate_data_quality_report` produces an executive summary, quality distribution, bias risk score, governance grade, and trend context in a single structured JSON response that feeds directly into reporting workflows.

#### Research data due diligence

Legal and compliance teams vetting third-party or open-source datasets for commercial use need to verify licensing chains and understand whether a dataset has been flagged by the research community. `analyze_data_provenance` traces each dataset's origin, cross-references it with academic papers and GitHub repositories, and identifies licensing gaps — producing integrity scores for each dataset in the provenance chain.

#### Emerging dataset monitoring for AI investment

Investors, product teams, and research leads tracking the data landscape for strategic decisions need to know which datasets are gaining traction before they become widely known. `track_dataset_trends` combines mention signals from research papers, repositories, and community discussions to rank datasets by trend score (mentions × 15 + source diversity × 10) and identify emerging data modalities.

### How to assess AI training data quality

1. **Connect your MCP client** — Add the server URL `https://ai-training-data-quality-mcp.apify.actor/mcp` to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client. No API keys required.
2. **Pick your starting tool** — For a quick quality check on a specific domain, start with `assess_dataset_quality`. For a full audit, use `generate_data_quality_report`. For bias-specific concerns, go directly to `detect_bias_indicators`.
3. **Run a query** — Provide a topic or domain (e.g., "medical imaging datasets", "LLM training data", "face recognition"). The server queries relevant sources and returns results in 30–120 seconds depending on source count and result limits.
4. **Act on recommendations** — Each tool returns a structured JSON response with per-dataset scores, strengths, weaknesses, and a recommendation tier (`highly_recommended`, `recommended`, `use_with_caution`, `not_recommended`). Use these to prioritize datasets for your training pipeline.

### MCP tools

| Tool | Price | Description |
|---|---|---|
| `map_data_landscape` | $0.045 | Map training data available for a topic across 7 sources. Returns quality-ranked inventory with cross-references. Default: 4 sources, 25 results each. |
| `assess_dataset_quality` | $0.045 | Score datasets on 5 weighted dimensions. Returns per-dataset breakdowns with recommendation tiers. Default: 3 sources, 30 results each. |
| `detect_bias_indicators` | $0.045 | Detect 7 bias types in dataset metadata and descriptions. Returns severity ratings and mitigation suggestions. Default: 4 sources, 30 results each. |
| `analyze_data_provenance` | $0.045 | Trace 4-stage provenance chains for datasets. Returns integrity scores and identified gaps. Default: 5 sources, 25 results each. |
| `score_data_governance` | $0.045 | Score governance across 5 compliance dimensions. Returns compliance status per dataset. Default: 3 sources, 30 results each. |
| `track_dataset_trends` | $0.045 | Track trending datasets and emerging modalities with configurable timeframe context. Default: 4 sources, 30 results each. |
| `assess_model_data_fit` | $0.045 | Assess dataset fit for 11 supported model types. Returns fit scores with gap analysis and alternatives. Default: 3 sources, 25 results each. |
| `generate_data_quality_report` | $0.045 | Comprehensive report combining all analyses. Returns executive summary, quality overview, bias assessment, governance summary, trends, and recommendations. Default: all 7 sources, 20 results each. |

#### Tool input parameters

All tools accept the following parameters:

| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| `query` | string | Yes | — | Topic, domain, or dataset name to analyze (e.g., "medical imaging", "CommonCrawl", "sentiment analysis") |
| `sources` | array | No | Varies by tool | Which sources to query: `training_data`, `github`, `arxiv`, `semantic_scholar`, `hackernews`, `wikipedia`, `data_gov` |
| `max_per_source` | number | No | 20–30 | Results to fetch per source (1–100). Lower = faster and cheaper; higher = more comprehensive |
| `model_type` | string | Yes (`assess_model_data_fit` only) | — | Model architecture: "LLM", "vision", "speech recognition", "multimodal", "diffusion", etc. |
| `timeframe` | string | No (`track_dataset_trends` only) | `"recent"` | Timeframe context string, e.g., "2024", "last 6 months", "recent" |

#### Example tool calls

**Quick bias check for a specific dataset type:**
```json
{
  "tool": "detect_bias_indicators",
  "arguments": {
    "query": "face recognition dataset",
    "sources": ["training_data", "arxiv", "semantic_scholar"],
    "max_per_source": 20
  }
}
````

**Full governance audit for a domain:**

```json
{
  "tool": "score_data_governance",
  "arguments": {
    "query": "healthcare NLP training data",
    "sources": ["training_data", "github", "data_gov"],
    "max_per_source": 30
  }
}
```

**Model-data fit assessment for an LLM:**

```json
{
  "tool": "assess_model_data_fit",
  "arguments": {
    "model_type": "LLM",
    "query": "text corpus multilingual",
    "sources": ["training_data", "github", "arxiv"],
    "max_per_source": 25
  }
}
```

**Comprehensive report for executive review:**

```json
{
  "tool": "generate_data_quality_report",
  "arguments": {
    "query": "autonomous vehicle perception datasets",
    "sources": ["training_data", "github", "arxiv", "semantic_scholar", "hackernews", "wikipedia", "data_gov"],
    "max_per_source": 15
  }
}
```

### ⬆️ Output example

Response from `assess_dataset_quality` for query "medical imaging":

```json
{
  "query": "medical imaging",
  "datasetsAssessed": 47,
  "averageQuality": 62,
  "qualityDistribution": {
    "excellent": 8,
    "good": 19,
    "fair": 14,
    "poor": 6
  },
  "datasets": [
    {
      "name": "NIH Chest X-Ray Dataset",
      "source": "training_data",
      "url": "https://nihcc.app.box.com/v/ChestXray-NIHCC",
      "quality": {
        "overall": 84,
        "completeness": 90,
        "recency": 55,
        "documentation": 88,
        "licenseOpenness": 85,
        "communityEngagement": 95
      },
      "strengths": [
        "Well-documented metadata",
        "Good documentation",
        "Open and permissive license",
        "Strong community engagement"
      ],
      "weaknesses": [
        "Outdated - not updated recently"
      ],
      "recommendation": "highly_recommended"
    },
    {
      "name": "MIMIC-CXR",
      "source": "training_data",
      "url": "https://physionet.org/content/mimic-cxr/",
      "quality": {
        "overall": 71,
        "completeness": 85,
        "recency": 70,
        "documentation": 80,
        "licenseOpenness": 45,
        "communityEngagement": 75
      },
      "strengths": [
        "Well-documented metadata",
        "Good documentation",
        "Recently updated",
        "Strong community engagement"
      ],
      "weaknesses": [
        "Restrictive or unclear license"
      ],
      "recommendation": "recommended"
    },
    {
      "name": "CheXpert",
      "source": "github",
      "url": "https://github.com/stanfordmlgroup/CheXpert",
      "quality": {
        "overall": 58,
        "completeness": 70,
        "recency": 35,
        "documentation": 65,
        "licenseOpenness": 50,
        "communityEngagement": 80
      },
      "strengths": [
        "Strong community engagement"
      ],
      "weaknesses": [
        "Outdated - not updated recently",
        "Restrictive or unclear license"
      ],
      "recommendation": "use_with_caution"
    }
  ]
}
```

### Output fields

#### assess\_dataset\_quality

| Field | Type | Description |
|---|---|---|
| `query` | string | The input query |
| `datasetsAssessed` | number | Total datasets evaluated |
| `averageQuality` | number | Mean quality score (0–100) across all datasets |
| `qualityDistribution.excellent` | number | Datasets scoring 75–100 |
| `qualityDistribution.good` | number | Datasets scoring 55–74 |
| `qualityDistribution.fair` | number | Datasets scoring 35–54 |
| `qualityDistribution.poor` | number | Datasets scoring 0–34 |
| `datasets[].name` | string | Dataset or resource name |
| `datasets[].source` | string | Source actor that returned this result |
| `datasets[].url` | string | Direct URL to dataset |
| `datasets[].quality.overall` | number | Weighted composite score (0–100) |
| `datasets[].quality.completeness` | number | Field population and metadata completeness (0–100) |
| `datasets[].quality.recency` | number | Last update date score (0–100) |
| `datasets[].quality.documentation` | number | README, description, and tagging quality (0–100) |
| `datasets[].quality.licenseOpenness` | number | License permissiveness for AI training (0–100) |
| `datasets[].quality.communityEngagement` | number | Stars, forks, and citations (0–100) |
| `datasets[].strengths` | array | List of positive quality signals |
| `datasets[].weaknesses` | array | List of quality concerns |
| `datasets[].recommendation` | string | `highly_recommended` / `recommended` / `use_with_caution` / `not_recommended` |

#### detect\_bias\_indicators

| Field | Type | Description |
|---|---|---|
| `biasIndicators[].type` | string | Bias category: `geographic`, `demographic`, `temporal`, `linguistic`, `domain`, `sampling`, `labeling` |
| `biasIndicators[].severity` | string | `low` / `medium` / `high` / `critical` |
| `biasIndicators[].description` | string | Human-readable description of the bias |
| `biasIndicators[].evidence` | array | Source-tagged evidence strings (e.g., "\[arxiv] RedditBias Dataset") |
| `biasIndicators[].mitigationSuggestions` | array | Actionable steps to address the bias |
| `overallBiasRisk` | string | `low` / `medium` / `high` / `critical` |
| `biasRiskScore` | number | Weighted composite bias risk (0–100) |

#### score\_data\_governance

| Field | Type | Description |
|---|---|---|
| `datasets[].governance.overall` | number | Composite governance score (0–100) |
| `datasets[].governance.licenseCompliance` | number | License clarity and training compatibility (0–100) |
| `datasets[].governance.privacyProtection` | number | PII handling, anonymization, consent signals (0–100) |
| `datasets[].governance.documentationQuality` | number | Datasheets, model cards, data cards (0–100) |
| `datasets[].governance.accessControl` | number | Authentication and versioning controls (0–100) |
| `datasets[].governance.auditability` | number | Change logs and provenance trail (0–100) |
| `datasets[].complianceStatus` | string | `compliant` / `partial` / `non_compliant` / `unknown` |
| `datasets[].risks` | array | Identified governance risk strings |

#### generate\_data\_quality\_report

| Field | Type | Description |
|---|---|---|
| `executiveSummary` | string | Narrative summary covering quality, bias risk, governance, and cross-references |
| `landscape.topDatasets` | array | Top 10 datasets ranked by quality score |
| `qualityOverview.averageQuality` | number | Mean quality across all assessed datasets |
| `biasAssessment.overallRisk` | string | Aggregated bias risk rating |
| `biasAssessment.riskScore` | number | Bias risk score (0–100) |
| `biasAssessment.topIndicators` | array | Top 5 bias indicators by severity |
| `governanceSummary.averageScore` | number | Mean governance score across datasets |
| `trends.emergingModalities` | array | Top 5 modalities by mention count |
| `trends.trendingDatasets` | array | Top 10 datasets by trend score |
| `recommendations` | array | Up to 10 prioritized, deduplicated action items |
| `sourcesConsulted` | array | Which source actors contributed to the report |

### How much does it cost to assess AI training data quality?

This MCP server uses **pay-per-event pricing** — you pay **$0.045 per tool call**. Platform compute costs are included. The Apify Free plan includes $5 of monthly credits — enough for **111 tool calls** at no cost.

| Scenario | Tool calls | Cost per call | Total cost |
|---|---|---|---|
| Single bias check | 1 | $0.045 | $0.045 |
| Domain evaluation (5 tools) | 5 | $0.045 | $0.23 |
| Full 8-tool assessment | 8 | $0.045 | $0.36 |
| Weekly audit (8 tools × 4 weeks) | 32 | $0.045 | $1.44 |
| Monthly compliance review (10 domains) | 80 | $0.045 | $3.60 |

You can set a **maximum spending limit** per run to control costs. The server checks the limit before each tool call and halts gracefully if the cap is reached.

Compare this to enterprise data governance platforms like Collibra or Alation at $50,000–$200,000/year. For most AI teams, this server covers data quality due diligence for $2–5/month with no subscription commitment.

### How to connect this MCP server

#### Claude Desktop

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "ai-training-data-quality": {
      "url": "https://ai-training-data-quality-mcp.apify.actor/mcp",
      "headers": {
        "Authorization": "Bearer YOUR_APIFY_TOKEN"
      }
    }
  }
}
```

#### Cursor / Windsurf / Cline

Add the MCP server URL `https://ai-training-data-quality-mcp.apify.actor/mcp` in your editor's MCP settings panel. Use your Apify API token as the Bearer token.

#### Programmatic (HTTP / cURL)

```bash
## Call the detect_bias_indicators tool directly
curl -X POST "https://ai-training-data-quality-mcp.apify.actor/mcp" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
      "name": "detect_bias_indicators",
      "arguments": {
        "query": "face recognition dataset",
        "sources": ["training_data", "arxiv", "semantic_scholar", "hackernews"],
        "max_per_source": 25
      }
    },
    "id": 1
  }'
```

#### Python (via Apify Actor API)

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/ai-training-data-quality-mcp").call(run_input={})

print(f"MCP server running. Endpoint: https://ai-training-data-quality-mcp.apify.actor/mcp")
print(f"Actor run ID: {run['id']}")
```

#### JavaScript (via Apify Actor API)

```javascript
import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("ryanclinton/ai-training-data-quality-mcp").call({});

console.log(`MCP server running. Endpoint: https://ai-training-data-quality-mcp.apify.actor/mcp`);
console.log(`Actor run ID: ${run.id}`);
```

### How the AI Training Data Quality MCP Server works

#### Phase 1: Parallel source querying

When a tool is called, the server invokes up to 7 Apify actor wrappers in parallel using `Promise.all()`. Each actor handles its own source: `ryanclinton/ai-training-data-curator` for dataset registries, `ryanclinton/github-repo-search` for code repositories, `ryanclinton/arxiv-paper-search` sorted by relevance, `ryanclinton/semantic-scholar-search` for citation-rich academic results, `ryanclinton/hackernews-search` for community discussion signals, `ryanclinton/wikipedia-article-search` for encyclopedic context, and `ryanclinton/datagov-dataset-search` for government open data. Each actor runs with a 180-second timeout and up to 500 items per dataset. Results from actors that return error messages are filtered out before network construction.

#### Phase 2: Data network construction

Results from all sources are assembled into a typed data network. Each item becomes a `DataNode` with inferred type (`dataset`, `repo`, `paper`, `discussion`, `article`, `gov_dataset`) and a normalized metadata object extracting name, description, license, stars, forks, citations, topics, and timestamps from source-specific field names. Nodes are deduplicated by a normalized ID (`source:name_slug`). Cross-reference edges are built by comparing every pair of nodes from different sources: if 3 or more significant words (length > 4) overlap between their combined name and description text, an edge is created with a relationship type inferred from node type pairs (`trains_on`, `evaluates`, `references`, `derived_from`, `describes`, `discusses`).

#### Phase 3: Quality and analysis scoring

Analysis functions operate on the completed network. **Quality scoring** computes 5 sub-scores per node: completeness (field population, 0–100), recency (date-based decay from 100 for <30 days to 20 for >2 years), documentation (description length tiers), license openness (20+ license keys mapped to explicit scores), and community engagement (logarithmic tiers for stars, forks, and citations). The weighted composite uses completeness×0.25 + documentation×0.25 + licenseOpenness×0.20 + recency×0.15 + communityEngagement×0.15. **Bias detection** scans combined node text against 15+ keyword patterns, groups matches by indicator type, and escalates severity when the same indicator appears across 5+ sources. **Bias risk score** uses weighted severity sums: critical×25, high×15, medium×8, low×3, capped at 100. **Governance scoring** uses description length, license string matching, and metadata presence to produce 5 sub-scores combined with the same dimensional weighting. **Model-data fit** matches node text against modality keyword lists (11 modalities defined) and model-specific feature requirement lists (11 model profiles), producing a fit score from base 20 + modality match (30) + feature matches (10 each, max 30) + quality contribution (quality×0.2).

#### Phase 4: Response assembly

Results are sorted by their primary score (quality, bias severity, governance, trend score, or fit score), truncated to reasonable limits (top 30 for quality, top 25 for trends, top 30 for governance), and serialized as structured JSON. The `generate_data_quality_report` tool invokes all 5 analysis functions and merges their outputs into a single executive report with a narrative summary string assembled from the aggregated statistics.

### Tips for best results

1. **Use fewer sources for speed.** The default source sets are tuned for quality-to-speed balance. For a fast bias check, use `["training_data", "arxiv", "semantic_scholar"]` and set `max_per_source: 15`. For maximum coverage, use all 7 sources with `max_per_source: 20`.

2. **Use `generate_data_quality_report` for new domains.** When evaluating a domain you have not assessed before, start with the comprehensive report to get a full picture in one call before drilling into specific dimensions.

3. **Target specific bias types with focused queries.** Instead of querying "NLP dataset", query "English-only NLP dataset" or "Reddit NLP corpus" to get bias detection results that reflect the specific risk vectors you are concerned about.

4. **Include `hackernews` for real-world sentiment.** Hacker News discussions often surface practical quality issues (data leakage, benchmark contamination, legal challenges) that do not appear in academic papers or dataset metadata.

5. **Include `data_gov` for regulated industries.** For healthcare, finance, or government AI applications, `data_gov` surfaces public-domain datasets with strong governance scores that are inherently HIPAA and GDPR-safe.

6. **Use `analyze_data_provenance` before licensing reviews.** Run provenance analysis before a legal team reviews data licensing. The integrity scores and chain gaps give the legal team a specific list of questions rather than requiring them to research from scratch.

7. **Combine `track_dataset_trends` with `assess_model_data_fit`.** Run trends first to identify which datasets are currently popular for your domain, then pass those dataset names into `assess_model_data_fit` as the query to get specific fit scores.

8. **Set `max_per_source: 10` for test runs.** Before committing to a comprehensive analysis, run a small test with 10 results per source to verify the query returns relevant results for your domain.

### Combine with other Apify actors

| Actor | How to combine |
|---|---|
| [AI Training Data Curator](https://apify.com/ryanclinton/ai-training-data-curator) | Run the curator directly for bulk dataset discovery, then pass dataset names into this MCP for quality scoring |
| [Company Deep Research](https://apify.com/ryanclinton/company-deep-research) | Research data provider companies before procurement — pair governance scores from this server with corporate due diligence |
| [Website Content to Markdown](https://apify.com/ryanclinton/website-content-to-markdown) | Convert dataset documentation pages to markdown for LLM-readable quality summaries |
| [WHOIS Domain Lookup](https://apify.com/ryanclinton/whois-domain-lookup) | Verify ownership and registration details for dataset hosting domains during provenance analysis |
| [Trustpilot Review Analyzer](https://apify.com/ryanclinton/trustpilot-review-analyzer) | Assess reputation of commercial data vendors alongside governance scores from this server |
| [SEC EDGAR Filing Analyzer](https://apify.com/ryanclinton/sec-edgar-filing-analyzer) | For public data companies, cross-reference SEC disclosures with governance assessments |
| [Website Tech Stack Detector](https://apify.com/ryanclinton/website-tech-stack-detector) | Detect infrastructure and security posture of dataset hosting platforms |

### Limitations

- **Metadata analysis only.** This server analyzes dataset documentation, descriptions, papers, and community discussions. It does not download or inspect actual dataset content. Pixel-level bias, statistical distributional analysis, and data poisoning detection require direct access to dataset files.
- **English-language sources.** All 7 data sources return primarily English-language content. Non-English dataset registries, Chinese AI research platforms, and regional government data portals are not queried.
- **Bias detection by keyword heuristics.** Bias indicators are identified by keyword matching against dataset metadata, not by statistical analysis of the underlying data distribution. A dataset description that does not mention "English only" or "US-centric" will not trigger those indicators even if the actual data is geographically concentrated.
- **License detection by string matching.** License scores are based on normalized string matching against a known license list. Non-standard or custom license terms may receive a generic fallback score of 40 rather than an accurate assessment.
- **No real-time data.** The Apify actor wrappers fetch current data, but the freshness of underlying sources depends on each actor's data pipeline. ArXiv and Semantic Scholar results typically reflect papers indexed within the past few days. Data.gov and some dataset registries may lag by weeks.
- **Government data limited to US.** The `data_gov` source queries Data.gov, which covers US federal datasets only. EU Open Data Portal, UK government data, and other national registries are not included.
- **Actor execution timeouts.** Each underlying actor call has a 180-second timeout. For very broad queries on large sources, some actors may time out and return empty results. The server handles this gracefully by returning results from the sources that succeeded.
- **Cross-reference edge quality depends on query specificity.** The keyword overlap algorithm for building edges between datasets, papers, and repos works best with specific, distinctive dataset names. Generic queries like "text data" may produce low-signal cross-reference networks.

### Integrations

- [Zapier](https://apify.com/integrations/zapier) — Trigger weekly dataset quality reports for your ML team and post results to Slack or email
- [Make](https://apify.com/integrations/make) — Build automated compliance workflows that check governance scores before data procurement approvals
- [Google Sheets](https://apify.com/integrations/google-sheets) — Export quality scores and governance assessments to a tracking spreadsheet for your data catalog
- [Apify API](https://docs.apify.com/api/v2) — Integrate quality checks directly into ML training pipelines as a pre-training gate
- [Webhooks](https://docs.apify.com/platform/integrations/webhooks) — Alert your team when a dataset governance score drops below a defined threshold
- [LangChain / LlamaIndex](https://docs.apify.com/platform/integrations) — Feed structured quality reports into RAG pipelines or agent workflows for automated data selection

### Troubleshooting

- **Tool returns empty datasets array.** The query may be too specific or a source may have returned an error. Try broadening the query, reducing `max_per_source` to 10, or removing sources that are less relevant to your domain. Check that your Apify token has sufficient credits.

- **Bias indicators seem generic or irrelevant.** Bias detection is keyword-driven and sensitive to query wording. A query like "NLP dataset" returns many results including web-crawled corpora, which trigger sampling bias indicators. Use a more specific query that names a particular dataset or data type to get targeted results.

- **Governance scores are uniformly "unknown" compliance status.** This occurs when dataset items from the queried sources lack license metadata. Try adding `data_gov` to your sources — government datasets consistently carry explicit license information. Adding `github` also helps, as GitHub repos typically display their license in metadata.

- **`generate_data_quality_report` is slow.** This tool queries all 7 sources (or whichever you specify). Reduce `max_per_source` to 10–15 for faster results. For most domains, 10 results per source is sufficient for a representative quality picture.

- **Tool call returns spending limit error.** The per-run spending limit set in your Apify account has been reached. Increase the run budget in your Apify console, or split your analysis across multiple targeted tool calls using tools like `assess_dataset_quality` with fewer sources instead of `generate_data_quality_report`.

### Responsible use

- This server queries publicly available dataset metadata, academic papers, code repositories, community discussions, and government open data registries.
- Quality, bias, and governance scores are algorithmic assessments based on available metadata — not authoritative certifications. Do not rely solely on these scores for high-stakes compliance decisions without human review.
- The bias detection system identifies signals in documentation. The absence of a bias indicator does not mean a dataset is free of that bias type.
- Comply with the terms of service of each underlying data source when using retrieved information for commercial purposes.
- For guidance on data scraping legality, see [Apify's guide](https://blog.apify.com/is-web-scraping-legal/).

### ❓ FAQ

**How many datasets can the AI training data quality MCP server assess in one tool call?**
Each source returns up to 100 results (`max_per_source` maximum). With all 7 sources at the maximum, a single call can assess up to 700 data points. In practice, defaults produce 60–210 assessed items. The `generate_data_quality_report` tool defaults to 20 per source across 7 sources for 140 data points total.

**How does AI training data quality scoring work?**
Quality is a weighted composite of 5 dimensions: completeness (25%), documentation (25%), license openness (20%), recency (15%), and community engagement (15%). Each dimension is scored 0–100 based on metadata signals. Completeness checks field population. Documentation scores description length in tiers. License openness maps 20+ license types to explicit scores (CC0 = 100, proprietary = 10). Recency applies a date-decay curve. Community engagement applies logarithmic tiers to stars, forks, and citations.

**What types of bias can this server detect in AI training data?**
7 bias types: geographic (US/Western over-representation), demographic (gender, race, platform skew), temporal (outdated or deprecated data), linguistic (English-only), domain (narrow domain coverage), sampling (crowdsourcing, web-scraping methodology biases), and labeling (annotation quality and cultural dependency). Detection is keyword-based on dataset descriptions, paper abstracts, and community discussions — not content-level analysis.

**Does this server detect bias in the actual training data content?**
No. The server analyzes metadata, documentation, and descriptions about datasets, not the dataset files themselves. For pixel-level fairness analysis, distributional bias in text corpora, or statistical representation gaps, you need specialized ML evaluation tools that process the actual data.

**How is AI training data governance scoring different from quality scoring?**
Quality scoring evaluates whether a dataset is well-documented, recent, and trusted by the community. Governance scoring evaluates regulatory and compliance fitness: license compatibility with AI training, privacy and PII handling, documentation quality for auditability, access controls, and audit trail completeness. A dataset can score high on quality but low on governance (e.g., high-quality but CC-BY-NC licensed, which restricts commercial training).

**Is it legal to use this server's data for AI training decisions?**
The server queries publicly available data sources. Using quality assessments and metadata to inform dataset selection decisions is legal. However, the datasets themselves each carry their own license terms — this server helps you identify those terms and flag restrictive licenses. Always verify dataset licensing independently before training. See [Apify's guide on web scraping legality](https://blog.apify.com/is-web-scraping-legal/).

**How accurate is the bias detection compared to academic bias auditing tools?**
Bias detection in this server is a fast metadata-level screen, not a substitute for rigorous bias auditing. It catches documented and self-described biases in dataset descriptions and associated literature. Studies like the Gender Shades audit or Datasheets for Datasets methodology require direct data access and human expertise. Use this server as a first-pass filter to prioritize which datasets need deeper human audit.

**How is this different from Hugging Face's dataset quality assessments?**
Hugging Face dataset cards provide self-reported quality information from dataset authors. This server cross-references that data with independent signals: academic paper citations (Semantic Scholar, ArXiv), community reception (Hacker News), code adoption (GitHub), and government data standards (Data.gov). The cross-source network analysis surfaces datasets that are trusted across multiple independent communities, not just well-documented by their creators.

**Can I schedule recurring AI training data quality audits?**
Yes. Use Apify Schedules to run the server on a weekly or monthly cadence. You can also call the underlying Apify actors directly via the API for integration into CI/CD pipelines or ML training workflows.

**How long does a `generate_data_quality_report` tool call take?**
With default settings (all 7 sources, 20 results each), expect 60–120 seconds. The 7 source actors run in parallel, so total time is approximately the slowest single actor's response time plus network overhead. Reducing `max_per_source` to 10 typically cuts runtime to 30–60 seconds.

**What model types does `assess_model_data_fit` support?**
11 model profiles: LLM (large language model), vision, image classification, object detection, speech recognition, translation, recommendation, reinforcement learning, multimodal, diffusion, and graph neural network. Each profile defines preferred data modalities, minimum scale expectations, and key feature requirements used to compute fit scores.

**Can I use this MCP server with agents built on LangChain or LlamaIndex?**
Yes. Any framework that supports the Model Context Protocol can connect to this server. LangChain's MCP integration and LlamaIndex's tool calling both work with the `/mcp` endpoint. The structured JSON output is well-suited for agent reasoning steps that decide which datasets to use or exclude.

### Help us improve

If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:

1. Go to [Account Settings > Privacy](https://console.apify.com/account/privacy)
2. Enable **Share runs with public Actor creators**

This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.

### Support

Found a bug or have a feature request? Open an issue in the [Issues tab](https://console.apify.com/actors/ai-training-data-quality-mcp/issues) on this actor's page. For custom solutions or enterprise integrations, reach out through the Apify platform.

# Actor input Schema

## Actor input object example

```json
{}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("ryanclinton/ai-training-data-quality-mcp").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("ryanclinton/ai-training-data-quality-mcp").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call ryanclinton/ai-training-data-quality-mcp --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ryanclinton/ai-training-data-quality-mcp",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Training Data Quality MCP Server",
        "description": "AI training data quality assessment, bias detection, and governance scoring for AI agents via the Model Context Protocol.",
        "version": "1.0",
        "x-build-id": "fFowCIQq3DL1bM1Qi"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ryanclinton~ai-training-data-quality-mcp/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ryanclinton-ai-training-data-quality-mcp",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ryanclinton~ai-training-data-quality-mcp/runs": {
            "post": {
                "operationId": "runs-sync-ryanclinton-ai-training-data-quality-mcp",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ryanclinton~ai-training-data-quality-mcp/run-sync": {
            "post": {
                "operationId": "run-sync-ryanclinton-ai-training-data-quality-mcp",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {}
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```