Pricing

from $10.00 / 1,000 page processeds

AI Training Data Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Pricing

from $10.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

Blukaze Automations

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

[1.0.0] - 2026-02-02

Added

Initial release of AI Training Data Scraper
4 chunking strategies: fixed_token, sentence_based, semantic, markdown_section
Dual crawler support: BeautifulSoup (fast) and Playwright (JS rendering)
Rich metadata extraction (15+ fields)
4 output formats: markdown, plain_text, json_structured, vector_ready
Vector database optimization for Pinecone, Qdrant, Weaviate, ChromaDB
LangChain and LlamaIndex integration support
Configurable content cleaning with CSS selectors
URL pattern inclusion/exclusion filters
Comprehensive error handling and logging
Full documentation with integration examples

Technical Details

Built with Apify SDK 1.7+ and Crawlee 0.3+
Uses tiktoken for accurate GPT-4 token counting
Sentence transformers for semantic chunking
NLTK for natural language processing
Readability algorithm for main content extraction

Future Roadmap

[1.1.0] - Planned

Built-in embedding generation (OpenAI, Cohere, Voyage)
Direct vector database push (Pinecone, Qdrant)
PDF and document file support
Custom extraction rules (XPath, JSON-LD)

[1.2.0] - Planned

Multi-language optimization
Table extraction and formatting
Image alt-text extraction
Sitemap-based crawling

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.

Ai Content Scraper Cleaner

dashjeevanthedev/ai-content-scraper-cleaner

AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.

JEEVAN JYOTI DASH

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

RAG-Ready Documentation Scraper

alaricus/rag-docs-markdown-scraper

Scrape documentation to framework-optimized Markdown. Features semantic chunking for LLM, vector database, and RAG pipelines. Parse XML sitemaps easily.

Alaricus

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

Moses Ndambuki

YouTube Transcript Extractor — AI-Ready Subtitles

wsgcjj/youtube-transcript

Extracts subtitles/transcripts from YouTube videos. Input a video URL or ID, get clean text output with metadata. Ideal for AI training data collection, content analysis, and LLM training pipelines.

陈俊杰

AI Training Data Collector — Clean Web Datasets for LLMs

avinashchby/ai-training-data-collector

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.

Avinash

AI Training Data Scraper

Changelog