AI Training Data Collector - RAG & LLM Dataset Builder

Under maintenance

Pricing

$250.00 / 1,000 results

Try for free

Go to Apify Store

AI Training Data Collector - RAG & LLM Dataset Builder

Under maintenance

Try for free

Developed by

Haithum

Maintained by Community

$0.25/page • Pay only for results Extract clean, LLM-ready content for AI training, RAG pipelines & vector databases. Perfect for fine-tuning datasets, knowledge bases, and chatbot training data. Smart extraction, auto token counting, NLP tagging, markdown export. MCP-compatible for Claude/Codex

0.0 (0)

Pricing

$250.00 / 1,000 results

Last modified

3 days ago

Automation

Developer tools

AI Training Data Collector (MCP-Compatible)

Extract clean, LLM-ready content from websites for AI training, RAG pipelines, and vector databases. Features auto-categorization, smart content extraction, token counting, and seamless integration with Claude/GPT via Model Context Protocol.

🌟 Features

Intelligent Content Extraction

Smart content detection - Automatically identifies main content, removes navigation/ads
Multi-site crawling - Follow links across domains with configurable depth
Topic filtering - Only collect content matching your keywords
Minimum quality thresholds - Skip thin content automatically

AI-Ready Output Formats

Markdown files - Individual .md files per page + combined training file
JSON dataset - Structured data with metadata and token counts
Plain text - Simple text extraction
Download ready - All files available in Key-Value Store

Automatic Categorization

NLP-powered tagging - Automatically extract topics and keywords
Metadata extraction - Title, author, publication date, description
Token counting - Precise cost calculation using tiktoken (GPT-4 tokenizer)

Vector Database Integration

Embedding-ready format - Pre-formatted for ChromaDB, Pinecone, Weaviate
Unique IDs - Auto-generated document identifiers
Metadata preservation - Source URLs, tags, titles

MCP Compatibility

Claude integration - Use as a tool in Claude Desktop
GPT integration - Compatible with GPT agent frameworks
Real-time data access - AI agents can autonomously collect fresh training data

🚀 Use Cases

1. RAG Pipeline Data Collection

Build knowledge bases for retrieval-augmented generation:

Input: Company documentation sites, technical blogs
Output: Clean markdown chunks with metadata
Use: Feed into LangChain, LlamaIndex, or custom RAG systems

2. LLM Fine-Tuning Datasets

Collect domain-specific training data:

Input: Industry-specific websites, research papers, forums
Output: High-quality text with auto-tagging
Use: Fine-tune GPT, Claude, or open-source models

3. AI Agent Knowledge Bases

Real-time data for AI agents:

Input: News sites, product pages, documentation
Output: MCP-compatible format
Use: Claude/GPT agents access fresh data on demand

4. Research & Analysis

Automated content aggregation:

Input: Multiple sources on a topic
Output: Categorized, tagged content
Use: Market research, competitive intelligence, trend analysis

📋 Input Configuration

Required Fields

Start URLs

URLs to begin crawling
Example: ["https://blog.example.com", "https://docs.example.com"]

Optional Configuration

Parameter	Default	Description
`crawlStrategy`	same-domain	How to follow links (same-domain, same-hostname, all)
`maxCrawlDepth`	3	Maximum link depth from start URLs
`maxPagesPerDomain`	100	Page limit per domain (controls cost)
`topicKeywords`	[]	Filter by topics (empty = collect all)
`contentSelectors`	auto	CSS selectors for main content
`excludeSelectors`	nav, footer, etc	Elements to remove
`outputFormat`	markdown	Output format (markdown/plain-text/json)
`includeMetadata`	true	Include title, author, dates, tags
`autoTagging`	true	NLP-powered keyword extraction
`minContentLength`	100 words	Minimum quality threshold
`tokenLimit`	unlimited	Stop after N tokens (cost control)
`embeddings`	false	Generate embedding-ready format
`mcpCompatible`	true	Model Context Protocol format

🎯 Example Inputs

Basic Documentation Scraping

{
  "startUrls": ["https://docs.python.org/3/"],
  "maxPagesPerDomain": 200,
  "topicKeywords": ["python", "programming", "tutorial"],
  "outputFormat": "markdown"
}

AI Training Data Collection

{
  "startUrls": [
    "https://machinelearningmastery.com",
    "https://towardsdatascience.com"
  ],
  "topicKeywords": ["machine learning", "deep learning", "neural networks"],
  "maxPagesPerDomain": 500,
  "autoTagging": true,
  "embeddings": true,
  "minContentLength": 300
}

RAG Pipeline for Customer Support

{
  "startUrls": ["https://support.yourcompany.com"],
  "crawlStrategy": "same-domain",
  "outputFormat": "markdown",
  "embeddings": true,
  "mcpCompatible": true,
  "contentSelectors": [".article-content", ".help-content"]
}

Cost-Controlled Crawl

{
  "startUrls": ["https://example.com"],
  "maxPagesPerDomain": 50,
  "tokenLimit": 100000,
  "minContentLength": 200
}

📊 Output Format

Standard Output

{
  "url": "https://example.com/article",
  "title": "Understanding Machine Learning",
  "content": "# Understanding Machine Learning\n\nMachine learning is...",
  "wordCount": 1543,
  "tokenCount": 2104,
  "metadata": {
    "description": "A comprehensive guide to ML",
    "author": "John Doe",
    "publishDate": "2024-01-15",
    "tags": ["machine learning", "ai", "data science"],
    "crawledAt": "2025-01-15T10:30:00.000Z"
  }
}

Embedding-Ready Format

{
  "embeddingReady": {
    "id": "aHR0cHM6Ly9leGFtcGxlLmNvbQ",
    "text": "Understanding Machine Learning\n\nMachine learning is...",
    "metadata": {
      "source": "https://example.com/article",
      "title": "Understanding Machine Learning",
      "tags": ["machine learning", "ai"]
    }
  }
}

MCP Format

{
  "mcp": {
    "type": "document",
    "source": "https://example.com/article",
    "title": "Understanding Machine Learning",
    "content": "# Understanding Machine Learning...",
    "tokens": 2104
  }
}

📥 Markdown File Downloads

In addition to the JSON dataset, the actor automatically generates downloadable Markdown files:

Individual Page Files

Each collected page is saved as a separate .md file in the Key-Value Store:

Naming: page-0001.md, page-0002.md, etc.
Format: Clean markdown with metadata header
Use case: Cherry-pick specific pages for training

Example file structure:

# Understanding Machine Learning

**Source:** https://example.com/article
**Collected:** 2025-01-15T10:30:00.000Z

---

# Understanding Machine Learning

Machine learning is a subset of artificial intelligence...

Combined Training File

All collected content merged into a single TRAINING_DATA.md file:

Location: Key-Value Store → TRAINING_DATA.md
Format: All pages concatenated with clear separators
Use case: Bulk LLM training, RAG ingestion, fine-tuning datasets

Example combined file:

# AI Training Data Collection

**Collection Date:** 2025-01-15T10:30:00.000Z
**Total Pages:** 26
**Total Tokens:** 93,986
**Total Words:** 62,421

---

<!-- Page 1 of 26 -->
# Understanding Machine Learning

**Source:** https://example.com/article
**Words:** 1,543 | **Tokens:** 2,104

---

Machine learning is a subset of artificial intelligence...

================================================================================

<!-- Page 2 of 26 -->
# Deep Learning Basics
...

How to Download

After run completes → Go to Storage tab
Click "Key-Value Store"
Download options:
- Individual files: Click any page-XXXX.md
- Combined file: Click TRAINING_DATA.md
- All files: Use "Download all" button

💰 Pricing Model

Pay-per-token pricing: $0.70 per 10,000 tokens

Cost Examples

Small doc site (50 pages, 500K tokens): $3.50
Medium blog (200 pages, 2M tokens): $14.00
Large knowledge base (1000 pages, 10M tokens): $70.00

Cost Control Features

Set tokenLimit to cap spending
Use minContentLength to skip thin content
Filter with topicKeywords to collect only relevant pages
Monitor real-time token count in logs

🔧 MCP Integration

Setup for Claude Desktop

Install MCP Server

$npm install -g @apify/mcp-server-apify

Add to Claude Config (~/Library/Application Support/Claude/config.json):

{
  "mcpServers": {
    "ai-data-collector": {
      "command": "npx",
      "args": ["-y", "@apify/mcp-server-apify", "ai-training-data-collector"],
      "env": {
        "APIFY_API_TOKEN": "your_apify_token_here"
      }
    }
  }
}

Restart Claude
Use in Claude

"Collect training data about React.js from reactjs.org documentation"

Claude will automatically call this actor and retrieve the data!

🛠️ Advanced Features

Custom Content Selectors

For sites with specific structures:

{
  "contentSelectors": [
    "article.post-content",
    ".documentation-body",
    "#main-content"
  ],
  "excludeSelectors": [
    ".comments-section",
    ".related-posts",
    ".advertisement"
  ]
}

Multi-Domain Crawling

{
  "startUrls": [
    "https://blog.company.com",
    "https://docs.company.com",
    "https://support.company.com"
  ],
  "crawlStrategy": "all",
  "maxPagesPerDomain": 200
}

Topic-Focused Collection

{
  "startUrls": ["https://news.ycombinator.com"],
  "topicKeywords": [
    "artificial intelligence",
    "machine learning",
    "llm",
    "gpt",
    "claude"
  ],
  "maxPagesPerDomain": 1000
}

📈 Performance

Crawl speed: ~10 pages per second
Content cleaning: Automatic removal of boilerplate
Token accuracy: Uses official tiktoken encoder
Memory efficient: Streams data to dataset

🔒 Best Practices

1. Respect robots.txt

The crawler automatically respects robots.txt directives.

2. Set Reasonable Limits

{
  "maxPagesPerDomain": 500,
  "maxCrawlDepth": 3,
  "tokenLimit": 1000000
}

3. Use Topic Filtering

Reduce costs by collecting only relevant content:

{
  "topicKeywords": ["your", "specific", "topics"]
}

4. Monitor Token Usage

Check logs for real-time token counts and cost estimates.

🐛 Troubleshooting

No Content Extracted

Check contentSelectors - may need site-specific selectors
Verify site allows crawling (check robots.txt)
Try different crawlStrategy

Too Many Pages Skipped

Lower minContentLength threshold
Broaden topicKeywords or remove filtering
Check excludeSelectors aren't removing main content

High Costs

Set tokenLimit to cap spending
Reduce maxPagesPerDomain
Use topicKeywords for targeted collection
Increase minContentLength to skip thin pages

📚 Use with Popular Frameworks

LangChain Integration

from langchain.document_loaders import ApifyDatasetLoader

loader = ApifyDatasetLoader(
    dataset_id="your_dataset_id",
    dataset_mapping_function=lambda item: Document(
        page_content=item["content"],
        metadata=item["metadata"]
    )
)
docs = loader.load()

LlamaIndex Integration

from llama_index import download_loader

ApifyLoader = download_loader("ApifyDataset")
loader = ApifyLoader("your_dataset_id")
documents = loader.load_data()

ChromaDB Integration

import chromadb

# Load dataset from Apify
dataset = apify_client.dataset("your_dataset_id").list_items().items

# Add to ChromaDB
for item in dataset:
    if "embeddingReady" in item:
        collection.add(
            documents=[item["embeddingReady"]["text"]],
            metadatas=[item["embeddingReady"]["metadata"]],
            ids=[item["embeddingReady"]["id"]]
        )

🎓 Examples & Tutorials

Coming soon:

Building a RAG chatbot with collected data
Fine-tuning GPT on custom datasets
Creating domain-specific knowledge bases
MCP integration patterns

📄 License

Apache-2.0

Collect high-quality training data. Build better AI systems.

On this page

AI Training Data Collector (MCP-Compatible)

Share Actor:

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

187

5.0

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

139

Jina Reader Cloud Wrapper

darkzogx/jina-reader-cloud-wrapper

Convert web pages to markdown for RAG/LLM. Batch URL processor extracts clean content from websites, PDFs, documentation. Web scraping for AI training data, knowledge bases, research. Jina AI Reader wrapper: auto-retry, ReaderLM-v2, cost tracking, image alt-text. $0.50/1K URLs

Haithum

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

85K

4.6

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

190

3.8

RAG Browser

visita/rag-browser

This Actor provides essential web browsing and content extraction functionality for AI Agents, LLM applications, and Retrieval-Augmented Generation (RAG) pipelines. It functions similarly to the web search feature in popular LLM chatbots, providing fresh, contextualized data directly from the web.

Visita AI & Automation

S3 to Markdown

consummate_hickory/s3FileToMarkdown

Transform S3 documents into perfect AI training data! Converts PDFs, Word, Excel, images, audio to clean Markdown that LLMs love. Uses Microsoft's markitdown engine. Ideal for RAG systems, AI agents, and machine learning pipelines.

Lorenzo Dalmazzo

5.0

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs.

Apify

7.2K

4.9

Reddit Scraper

benthepythondev/reddit-scraper

Extract Reddit posts, comments & user data in AI-ready markdown format. No API keys needed! 25% cheaper than competitors. Perfect for AI training, sentiment analysis & market research. Includes bulk comment scraping with progress tracking.

ben

AI Training Data Collector - RAG & LLM Dataset Builder

AI Training Data Collector - RAG & LLM Dataset Builder

AI Training Data Collector (MCP-Compatible)

🌟 Features

Intelligent Content Extraction

AI-Ready Output Formats

Automatic Categorization

Vector Database Integration

MCP Compatibility

🚀 Use Cases

1. RAG Pipeline Data Collection

2. LLM Fine-Tuning Datasets

3. AI Agent Knowledge Bases

4. Research & Analysis

📋 Input Configuration

Required Fields

Optional Configuration

🎯 Example Inputs

Basic Documentation Scraping

AI Training Data Collection

RAG Pipeline for Customer Support

Cost-Controlled Crawl

📊 Output Format

Standard Output

Embedding-Ready Format

MCP Format

📥 Markdown File Downloads

Individual Page Files

Combined Training File

How to Download

💰 Pricing Model

Cost Examples

Cost Control Features

🔧 MCP Integration

Setup for Claude Desktop

🛠️ Advanced Features

Custom Content Selectors

Multi-Domain Crawling

Topic-Focused Collection

📈 Performance

🔒 Best Practices

1. Respect robots.txt

2. Set Reasonable Limits

3. Use Topic Filtering

4. Monitor Token Usage

🐛 Troubleshooting

No Content Extracted

Too Many Pages Skipped

High Costs

📚 Use with Popular Frameworks

LangChain Integration

LlamaIndex Integration

ChromaDB Integration

🎓 Examples & Tutorials

📄 License

You might also like

Website Content to Markdown for LLM Training

AI-Powered Web Content & Link Extractor

Jina Reader Cloud Wrapper

Website Content Crawler

🔥 FireScrape AI Website Content Markdown Scraper

RAG Browser

S3 to Markdown

Website Content Crawler Pro

RAG Web Browser

Reddit Scraper