Pricing

Pay per event

AI Training Data Collector — Clean Web Datasets for LLMs

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Avinash

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

AI Training Data Collector — Structured Web Datasets for LLMs

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. This AI training data collector removes boilerplate, deduplicates content, and scores quality for AI training.

Features

Smart content extraction: Removes navigation, ads, footers, and boilerplate
Multi-format output: Markdown, plain text, or JSON-Lines
Quality scoring: Each page scored 0-100 for training suitability
Deduplication: Content hash-based deduplication across pages
Configurable depth: 0-3 levels of crawl depth from start URLs
Pattern exclusion: Skip unwanted URL patterns (tags, categories, etc.)

Input Parameters

Parameter	Type	Default	Description
`urls`	array	Wikipedia AI page	Start URLs to crawl
`crawlDepth`	integer	`1`	Link follow depth (0-3)
`maxPages`	integer	`5`	Max pages to process
`outputFormat`	string	`markdown`	Content format
`excludePatterns`	array	tags, categories	URL patterns to skip
`minWordCount`	integer	`100`	Skip short pages
`proxyConfiguration`	object	Apify Proxy	Proxy for reliable scraping

Use Cases

LLM fine-tuning: Build custom training datasets from any website
RAG pipelines: Create knowledge base documents for retrieval-augmented generation
Research datasets: Collect structured content for academic research
Competitive analysis: Extract and analyze competitor website content

Cost Estimate

5 pages: ~$0.25
100 pages: ~$2.00
1000 pages: ~$15.00

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

batuhan senavci

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website to Markdown for LLMs and RAG

rodrgds/website-to-markdown

Convert webpages into clean markdown for LLMs, RAG pipelines, AI datasets, archives, and content extraction. Simple pay-per-page pricing.

Rodrigo Dias

Website Content Crawler - Markdown & Text for LLM / RAG

pear_fight/website-content-crawler-markdown-text-for-llm-rag

Crawl any website and extract clean article text and Markdown, ready to feed into LLMs, ChatGPT, vector databases and RAG pipelines. Removes navigation, ads and boilerplate. Configurable crawl depth and page limits. Export to JSON, CSV, Excel.

Harald

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.