AI Training Data Scraper avatar
AI Training Data Scraper

Pricing

from $10.00 / 1,000 page processeds

Go to Apify Store
AI Training Data Scraper

AI Training Data Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Pricing

from $10.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

Blukaze Automations

Blukaze Automations

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Apify Actor Python 3.11 License: MIT

Extract clean, semantically-chunked content from websites optimized for LLMs, RAG pipelines, and vector databases.

Transform raw web content into AI-ready training data with intelligent chunking, rich metadata extraction, and vector database optimization.


โœจ Key Features

  • ๐ŸŽฏ 4 Smart Chunking Strategies: Fixed token, sentence-based, semantic, and markdown section
  • ๐Ÿงน Intelligent Content Cleaning: Removes navigation, ads, and boilerplate automatically
  • ๐Ÿ“Š Rich Metadata Extraction: Author, dates, keywords, language detection, content type
  • ๐Ÿ”— Deep Recursive Crawling: Crawl entire documentation sites with configurable depth
  • โšก Vector Database Ready: Output formatted for Pinecone, Qdrant, Weaviate, ChromaDB
  • ๐Ÿฆœ LangChain & LlamaIndex Compatible: Direct integration with popular AI frameworks
  • ๐ŸŽจ Multiple Output Formats: Markdown, plain text, JSON structured, vector-ready
  • ๐Ÿš€ Dual Crawler Support: Fast HTTP for static sites, Playwright for JS-heavy sites
  • ๐Ÿ”’ Respectful Crawling: Respects robots.txt, configurable rate limiting

๐Ÿ“‹ Use Cases

Use CaseDescription
RAG ApplicationsBuild accurate retrieval-augmented generation systems with clean documentation
AI ChatbotsTrain domain-specific chatbots on your knowledge base
Code AssistantsExtract technical documentation for programming assistants
LLM Fine-tuningCollect high-quality training data for domain-specific models
Semantic SearchPopulate vector databases for intelligent search systems
Knowledge ManagementStructure and organize documentation for AI consumption

๐Ÿš€ Quick Start

1. Basic Usage

{
"startUrls": [{"url": "https://docs.python.org/3/"}],
"crawlerType": "cheerio",
"maxCrawlPages": 100,
"chunkingStrategy": "semantic",
"outputFormat": "vector_ready"
}

2. Advanced Configuration

{
"startUrls": [
{"url": "https://docs.example.com/"},
{"url": "https://api.example.com/docs"}
],
"crawlerType": "playwright",
"maxCrawlPages": 500,
"maxCrawlDepth": 10,
"chunkingStrategy": "semantic",
"chunkSize": 512,
"chunkOverlap": 100,
"outputFormat": "vector_ready",
"removeElements": ["nav", "header", "footer", ".sidebar", ".ads"],
"includeMetadata": true,
"proxyConfiguration": {"useApifyProxy": true}
}

๐Ÿ“ฆ Chunking Strategies

Choose the right chunking strategy for your use case:

1. Fixed Token (fixed_token)

  • Best for: Consistent chunk sizes for embedding models with token limits
  • How it works: Splits content into fixed-size token chunks (default: 512 tokens)
  • Use when: You need precise control over chunk sizes for OpenAI/Anthropic embeddings

2. Sentence-Based (sentence_based)

  • Best for: Preserving natural language boundaries
  • How it works: Groups complete sentences until reaching target size
  • Use when: You want readable chunks that never cut mid-sentence
  • Best for: Optimal RAG performance
  • How it works: Uses NLP to detect topic boundaries and group related content
  • Use when: Building RAG systems where context preservation is critical

4. Markdown Section (markdown_section)

  • Best for: Documentation and structured content
  • How it works: Splits by heading hierarchy (## Section, ### Subsection)
  • Use when: Scraping markdown-based documentation or wikis

๐Ÿ“Š Output Schema

Each extracted page produces structured output:

{
"url": "https://docs.example.com/guide",
"title": "Getting Started Guide",
"content_format": "vector_ready",
"full_content": "Complete page text...",
"chunks": [
{
"id": "a1b2c3d4_chunk_0",
"text": "Introduction to the framework...",
"metadata": {
"source_url": "https://docs.example.com/guide",
"page_title": "Getting Started Guide",
"chunk_index": 0,
"token_count": 487,
"has_code": true,
"section_title": "Introduction",
"language": "en",
"content_type": "documentation"
}
}
],
"metadata": {
"author": "John Doe",
"published_date": "2025-01-15T00:00:00Z",
"language": "en",
"keywords": ["python", "tutorial", "getting-started"],
"word_count": 2500,
"estimated_reading_time": 10,
"content_type": "documentation",
"has_code_blocks": true
},
"embedding_info": {
"chunk_count": 8,
"total_tokens": 3200,
"ready_for_embedding": true,
"recommended_model": "text-embedding-3-small"
},
"crawl_info": {
"crawled_at": "2026-02-02T12:00:00Z",
"crawl_depth": 2
}
}

๐Ÿ”— Integration Examples

LangChain

from langchain.document_loaders import ApifyDatasetLoader
from langchain.schema import Document
def transform_dataset_item(item):
documents = []
for chunk in item.get("chunks", []):
documents.append(Document(
page_content=chunk["text"],
metadata=chunk["metadata"]
))
return documents
loader = ApifyDatasetLoader(
dataset_id="your_dataset_id",
dataset_mapping_function=transform_dataset_item
)
documents = loader.load()

LlamaIndex

from llama_index import Document
from apify_client import ApifyClient
client = ApifyClient("your_api_token")
dataset = client.dataset("your_dataset_id").list_items().items
documents = []
for item in dataset:
for chunk in item.get("chunks", []):
documents.append(Document(
text=chunk["text"],
metadata=chunk["metadata"]
))

Pinecone

import pinecone
from openai import OpenAI
# Initialize
pinecone.init(api_key="your-api-key")
index = pinecone.Index("your-index")
openai = OpenAI()
# Upsert chunks
for item in dataset:
for chunk in item["chunks"]:
# Generate embedding
response = openai.embeddings.create(
input=chunk["text"],
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
# Upsert to Pinecone
index.upsert([(
chunk["id"],
embedding,
chunk["metadata"]
)])

Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
client = QdrantClient("localhost", port=6333)
points = []
for item in dataset:
for chunk in item["chunks"]:
embedding = embed_model.encode(chunk["text"])
points.append(PointStruct(
id=hash(chunk["id"]),
vector=embedding.tolist(),
payload=chunk["metadata"]
))
client.upsert(collection_name="docs", points=points)

โš™๏ธ Configuration Reference

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to begin crawling
crawlerTypestring"cheerio""cheerio" (fast) or "playwright" (JS support)
maxCrawlPagesinteger100Maximum pages to crawl
maxCrawlDepthinteger20Maximum crawl depth from start URLs
chunkingStrategystring"semantic"Chunking algorithm to use
chunkSizeinteger512Target chunk size (tokens/words)
chunkOverlapinteger100Overlap between chunks
outputFormatstring"vector_ready"Output format
removeElementsarraysee belowCSS selectors to remove
includeMetadatabooleantrueExtract page metadata
extractLinksbooleanfalseInclude discovered hyperlinks in the output dataset (does not affect crawling)
excludeUrlPatternsarraysee belowURL patterns to skip

Default Remove Elements

["nav", "header", "footer", ".advertisement", "#cookie-banner", ".sidebar"]

Default Exclude Patterns

["**/login**", "**/signup**", "**/register**", "**/cart**", "**/checkout**"]

๐Ÿ’ก Pro Tips

For Best RAG Performance

  • Use semantic chunking for intelligent topic grouping
  • Set chunk overlap to 15-20% of chunk size (e.g., 100 for 512-token chunks)
  • Enable metadata extraction for better filtering during retrieval
  • Target 400-600 tokens per chunk for most embedding models

For Large Documentation Sites

  • Start with maxCrawlPages = 50 to test configuration
  • Use Cheerio crawler (10x faster) unless site requires JavaScript
  • Set exclude patterns for login, user profiles, and dynamic pages
  • Enable Apify Proxy to avoid rate limiting

For Code-Heavy Content

  • Markdown section chunking preserves code block structure
  • Extracted metadata includes code languages detected
  • Code blocks are never split mid-block

๐Ÿ”ง Troubleshooting

Getting blocked by website

Solution: Enable Apify Proxy in configuration. For aggressive blocking, use residential proxies.

Missing content on JavaScript sites / Only 1 page crawled

Solution: Switch to "crawlerType": "playwright" for full JavaScript rendering. If your logs say "1 page crawled" and you are using Cheerio on a React/Vue/SPA app, the crawler is seeing an empty shell.

Chunks too large for embeddings

Solution: Reduce chunkSize to 384-512 tokens. Most models have 8192-token limits.

Empty chunks generated

Solution: Check removeElements - you may be removing content containers. Reduce selectors.

Slow crawling speed

Solution: Increase maxConcurrency (carefully) or use Cheerio crawler for static sites.


๐Ÿ“ˆ Performance

MetricCheerio CrawlerPlaywright Crawler
Speed~10 pages/sec~1 page/sec
JavaScript SupportโŒ Noโœ… Yes
Memory UsageLowMedium
Best ForDocumentation, BlogsSPAs, Dynamic Sites

๐Ÿ† Why This Actor?

Compared to generic web scrapers, AI Training Data Scraper offers:

FeatureGeneric ScrapersThis Actor
Token-Aware ChunkingโŒโœ… Uses tiktoken
Semantic ChunkingโŒโœ… NLP-based
Vector DB ReadyโŒโœ… Pre-formatted
Code Block HandlingโŒโœ… Never splits
Metadata ExtractionBasic15+ fields
RAG OptimizationโŒโœ… Purpose-built

๐Ÿ“ž Support

  • Issues: Report bugs via Apify Console
  • Feature Requests: Submit through Apify feedback
  • Documentation: Full API Reference

๐Ÿ“„ License

MIT License - See LICENSE file for details.


Built for developers, optimized for AI. โšก

Transform the web into training data.

Created by Blukaze Automation