AI Training Data Collector - RAG & LLM Dataset Builder avatar
AI Training Data Collector - RAG & LLM Dataset Builder
Under maintenance

Pricing

$250.00 / 1,000 results

Go to Apify Store
AI Training Data Collector - RAG & LLM Dataset Builder

AI Training Data Collector - RAG & LLM Dataset Builder

Under maintenance

Developed by

Haithum

Haithum

Maintained by Community

$0.25/page โ€ข Pay only for results Extract clean, LLM-ready content for AI training, RAG pipelines & vector databases. Perfect for fine-tuning datasets, knowledge bases, and chatbot training data. Smart extraction, auto token counting, NLP tagging, markdown export. MCP-compatible for Claude/Codex

0.0 (0)

Pricing

$250.00 / 1,000 results

0

1

0

Last modified

3 days ago

AI Training Data Collector (MCP-Compatible)

Extract clean, LLM-ready content from websites for AI training, RAG pipelines, and vector databases. Features auto-categorization, smart content extraction, token counting, and seamless integration with Claude/GPT via Model Context Protocol.

๐ŸŒŸ Features

Intelligent Content Extraction

  • Smart content detection - Automatically identifies main content, removes navigation/ads
  • Multi-site crawling - Follow links across domains with configurable depth
  • Topic filtering - Only collect content matching your keywords
  • Minimum quality thresholds - Skip thin content automatically

AI-Ready Output Formats

  • Markdown files - Individual .md files per page + combined training file
  • JSON dataset - Structured data with metadata and token counts
  • Plain text - Simple text extraction
  • Download ready - All files available in Key-Value Store

Automatic Categorization

  • NLP-powered tagging - Automatically extract topics and keywords
  • Metadata extraction - Title, author, publication date, description
  • Token counting - Precise cost calculation using tiktoken (GPT-4 tokenizer)

Vector Database Integration

  • Embedding-ready format - Pre-formatted for ChromaDB, Pinecone, Weaviate
  • Unique IDs - Auto-generated document identifiers
  • Metadata preservation - Source URLs, tags, titles

MCP Compatibility

  • Claude integration - Use as a tool in Claude Desktop
  • GPT integration - Compatible with GPT agent frameworks
  • Real-time data access - AI agents can autonomously collect fresh training data

๐Ÿš€ Use Cases

1. RAG Pipeline Data Collection

Build knowledge bases for retrieval-augmented generation:

Input: Company documentation sites, technical blogs
Output: Clean markdown chunks with metadata
Use: Feed into LangChain, LlamaIndex, or custom RAG systems

2. LLM Fine-Tuning Datasets

Collect domain-specific training data:

Input: Industry-specific websites, research papers, forums
Output: High-quality text with auto-tagging
Use: Fine-tune GPT, Claude, or open-source models

3. AI Agent Knowledge Bases

Real-time data for AI agents:

Input: News sites, product pages, documentation
Output: MCP-compatible format
Use: Claude/GPT agents access fresh data on demand

4. Research & Analysis

Automated content aggregation:

Input: Multiple sources on a topic
Output: Categorized, tagged content
Use: Market research, competitive intelligence, trend analysis

๐Ÿ“‹ Input Configuration

Required Fields

Start URLs

  • URLs to begin crawling
  • Example: ["https://blog.example.com", "https://docs.example.com"]

Optional Configuration

ParameterDefaultDescription
crawlStrategysame-domainHow to follow links (same-domain, same-hostname, all)
maxCrawlDepth3Maximum link depth from start URLs
maxPagesPerDomain100Page limit per domain (controls cost)
topicKeywords[]Filter by topics (empty = collect all)
contentSelectorsautoCSS selectors for main content
excludeSelectorsnav, footer, etcElements to remove
outputFormatmarkdownOutput format (markdown/plain-text/json)
includeMetadatatrueInclude title, author, dates, tags
autoTaggingtrueNLP-powered keyword extraction
minContentLength100 wordsMinimum quality threshold
tokenLimitunlimitedStop after N tokens (cost control)
embeddingsfalseGenerate embedding-ready format
mcpCompatibletrueModel Context Protocol format

๐ŸŽฏ Example Inputs

Basic Documentation Scraping

{
"startUrls": ["https://docs.python.org/3/"],
"maxPagesPerDomain": 200,
"topicKeywords": ["python", "programming", "tutorial"],
"outputFormat": "markdown"
}

AI Training Data Collection

{
"startUrls": [
"https://machinelearningmastery.com",
"https://towardsdatascience.com"
],
"topicKeywords": ["machine learning", "deep learning", "neural networks"],
"maxPagesPerDomain": 500,
"autoTagging": true,
"embeddings": true,
"minContentLength": 300
}

RAG Pipeline for Customer Support

{
"startUrls": ["https://support.yourcompany.com"],
"crawlStrategy": "same-domain",
"outputFormat": "markdown",
"embeddings": true,
"mcpCompatible": true,
"contentSelectors": [".article-content", ".help-content"]
}

Cost-Controlled Crawl

{
"startUrls": ["https://example.com"],
"maxPagesPerDomain": 50,
"tokenLimit": 100000,
"minContentLength": 200
}

๐Ÿ“Š Output Format

Standard Output

{
"url": "https://example.com/article",
"title": "Understanding Machine Learning",
"content": "# Understanding Machine Learning\n\nMachine learning is...",
"wordCount": 1543,
"tokenCount": 2104,
"metadata": {
"description": "A comprehensive guide to ML",
"author": "John Doe",
"publishDate": "2024-01-15",
"tags": ["machine learning", "ai", "data science"],
"crawledAt": "2025-01-15T10:30:00.000Z"
}
}

Embedding-Ready Format

{
"embeddingReady": {
"id": "aHR0cHM6Ly9leGFtcGxlLmNvbQ",
"text": "Understanding Machine Learning\n\nMachine learning is...",
"metadata": {
"source": "https://example.com/article",
"title": "Understanding Machine Learning",
"tags": ["machine learning", "ai"]
}
}
}

MCP Format

{
"mcp": {
"type": "document",
"source": "https://example.com/article",
"title": "Understanding Machine Learning",
"content": "# Understanding Machine Learning...",
"tokens": 2104
}
}

๐Ÿ“ฅ Markdown File Downloads

In addition to the JSON dataset, the actor automatically generates downloadable Markdown files:

Individual Page Files

Each collected page is saved as a separate .md file in the Key-Value Store:

  • Naming: page-0001.md, page-0002.md, etc.
  • Format: Clean markdown with metadata header
  • Use case: Cherry-pick specific pages for training

Example file structure:

# Understanding Machine Learning
**Source:** https://example.com/article
**Collected:** 2025-01-15T10:30:00.000Z
---
# Understanding Machine Learning
Machine learning is a subset of artificial intelligence...

Combined Training File

All collected content merged into a single TRAINING_DATA.md file:

  • Location: Key-Value Store โ†’ TRAINING_DATA.md
  • Format: All pages concatenated with clear separators
  • Use case: Bulk LLM training, RAG ingestion, fine-tuning datasets

Example combined file:

# AI Training Data Collection
**Collection Date:** 2025-01-15T10:30:00.000Z
**Total Pages:** 26
**Total Tokens:** 93,986
**Total Words:** 62,421
---
<!-- Page 1 of 26 -->
# Understanding Machine Learning
**Source:** https://example.com/article
**Words:** 1,543 | **Tokens:** 2,104
---
Machine learning is a subset of artificial intelligence...
================================================================================
<!-- Page 2 of 26 -->
# Deep Learning Basics
...

How to Download

  1. After run completes โ†’ Go to Storage tab
  2. Click "Key-Value Store"
  3. Download options:
    • Individual files: Click any page-XXXX.md
    • Combined file: Click TRAINING_DATA.md
    • All files: Use "Download all" button

๐Ÿ’ฐ Pricing Model

Pay-per-token pricing: $0.70 per 10,000 tokens

Cost Examples

  • Small doc site (50 pages, 500K tokens): $3.50
  • Medium blog (200 pages, 2M tokens): $14.00
  • Large knowledge base (1000 pages, 10M tokens): $70.00

Cost Control Features

  • Set tokenLimit to cap spending
  • Use minContentLength to skip thin content
  • Filter with topicKeywords to collect only relevant pages
  • Monitor real-time token count in logs

๐Ÿ”ง MCP Integration

Setup for Claude Desktop

  1. Install MCP Server
$npm install -g @apify/mcp-server-apify
  1. Add to Claude Config (~/Library/Application Support/Claude/config.json):
{
"mcpServers": {
"ai-data-collector": {
"command": "npx",
"args": ["-y", "@apify/mcp-server-apify", "ai-training-data-collector"],
"env": {
"APIFY_API_TOKEN": "your_apify_token_here"
}
}
}
}
  1. Restart Claude

  2. Use in Claude

"Collect training data about React.js from reactjs.org documentation"

Claude will automatically call this actor and retrieve the data!

๐Ÿ› ๏ธ Advanced Features

Custom Content Selectors

For sites with specific structures:

{
"contentSelectors": [
"article.post-content",
".documentation-body",
"#main-content"
],
"excludeSelectors": [
".comments-section",
".related-posts",
".advertisement"
]
}

Multi-Domain Crawling

{
"startUrls": [
"https://blog.company.com",
"https://docs.company.com",
"https://support.company.com"
],
"crawlStrategy": "all",
"maxPagesPerDomain": 200
}

Topic-Focused Collection

{
"startUrls": ["https://news.ycombinator.com"],
"topicKeywords": [
"artificial intelligence",
"machine learning",
"llm",
"gpt",
"claude"
],
"maxPagesPerDomain": 1000
}

๐Ÿ“ˆ Performance

  • Crawl speed: ~10 pages per second
  • Content cleaning: Automatic removal of boilerplate
  • Token accuracy: Uses official tiktoken encoder
  • Memory efficient: Streams data to dataset

๐Ÿ”’ Best Practices

1. Respect robots.txt

The crawler automatically respects robots.txt directives.

2. Set Reasonable Limits

{
"maxPagesPerDomain": 500,
"maxCrawlDepth": 3,
"tokenLimit": 1000000
}

3. Use Topic Filtering

Reduce costs by collecting only relevant content:

{
"topicKeywords": ["your", "specific", "topics"]
}

4. Monitor Token Usage

Check logs for real-time token counts and cost estimates.

๐Ÿ› Troubleshooting

No Content Extracted

  • Check contentSelectors - may need site-specific selectors
  • Verify site allows crawling (check robots.txt)
  • Try different crawlStrategy

Too Many Pages Skipped

  • Lower minContentLength threshold
  • Broaden topicKeywords or remove filtering
  • Check excludeSelectors aren't removing main content

High Costs

  • Set tokenLimit to cap spending
  • Reduce maxPagesPerDomain
  • Use topicKeywords for targeted collection
  • Increase minContentLength to skip thin pages

LangChain Integration

from langchain.document_loaders import ApifyDatasetLoader
loader = ApifyDatasetLoader(
dataset_id="your_dataset_id",
dataset_mapping_function=lambda item: Document(
page_content=item["content"],
metadata=item["metadata"]
)
)
docs = loader.load()

LlamaIndex Integration

from llama_index import download_loader
ApifyLoader = download_loader("ApifyDataset")
loader = ApifyLoader("your_dataset_id")
documents = loader.load_data()

ChromaDB Integration

import chromadb
# Load dataset from Apify
dataset = apify_client.dataset("your_dataset_id").list_items().items
# Add to ChromaDB
for item in dataset:
if "embeddingReady" in item:
collection.add(
documents=[item["embeddingReady"]["text"]],
metadatas=[item["embeddingReady"]["metadata"]],
ids=[item["embeddingReady"]["id"]]
)

๐ŸŽ“ Examples & Tutorials

Coming soon:

  • Building a RAG chatbot with collected data
  • Fine-tuning GPT on custom datasets
  • Creating domain-specific knowledge bases
  • MCP integration patterns

๐Ÿ“„ License

Apache-2.0


Collect high-quality training data. Build better AI systems.