Pricing

from $2.10 / 1,000 saved training pages

AI Training Data Curator

Turn any public website into a clean LLM training dataset. Crawl docs, blogs, and help centers, extract readable text, filter by language, remove duplicates, and export JSON, JSONL, or CSV for fine-tuning, RAG, and AI workflows. No coding required.

Pricing

from $2.10 / 1,000 saved training pages

Rating

0.0

(0)

Developer

Vamsi Krishna

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

What does AI Training Data Curator do?

AI Training Data Curator is a website crawler for LLM training data. It turns public web pages into structured datasets — without writing scrapers or cleaning HTML yourself.

You provide URLs or a sitemap. The Actor:

Crawls your target website or documentation site
Extracts clean article-style text (not raw HTML)
Filters short pages, wrong languages, and navigation boilerplate
Removes near-duplicate pages
Exports one dataset row per saved page

Use it to scrape websites for AI training, build RAG knowledge bases, or collect text corpora from docs, blogs, and help centers.

Why scrape websites for LLM training data?

Goal	What you get
LLM fine-tuning	Clean text + metadata per page, ready for JSONL export
RAG / chatbot knowledge base	One document per URL to chunk and embed in a vector DB
Documentation dataset	Crawl a docs site or sitemap at scale
Deduplicated corpus	Near-duplicate pages removed automatically
English-only data	Optional language filter (e.g. `en`)

Best for documentation, blogs, help centers, news, and marketing content.

Run on a schedule, scale to 100,000 pages, and integrate via the Apify API — no local server needed.

What data does this web scraper extract?

Each saved page includes:

Field	Description
URL	Page link
Title	Page title
Text	Clean extracted body — main field for training
Language	Detected language (e.g. English)
Word count	Words in the extracted text
Author	When available on the page
Published date	When available on the page

Example:

URL	Title	Text (shortened)	Words
docs.example.com/guide	Getting Started	This guide walks you through setup…	842

Exports include the full text — not shortened like the table preview.

How to build an LLM training dataset from a website

Click Start on this Actor page.
Add Start URLs — paste links to pages you want, or a Sitemap URL for bulk crawling.
Set Max pages (default: 10; up to 100,000 for large corpora).
Choose Crawl strategy:
- Seeds only — only your listed URLs (fast, predictable)
- Recurse — follow internal links (keep Stay within seed domains on)
Optional: Language filter (en), Minimum text length, Remove duplicates (recommended).
Click Run → open Dataset → Export (JSON, JSONL, CSV, or Excel).

First run? Use the default prefill and click Run to see a sample output in under a minute.

Input settings (see Input tab for all options)

Setting	Recommendation
Start URLs	Required unless you use a sitemap
Max pages	Default 10; raise for larger datasets
Crawl strategy	Seeds only for listed URLs · Recurse to follow site links
Stay within seed domains	Keep on for single-site crawls
Language filter	`en` for English-only LLM training data
Remove duplicates	Keep on
Proxy	Enable if the site blocks datacenter IPs

Docs site tip: Recurse + stay within domain + language en + max pages 100–500.

Output — download your web dataset

Export formats: JSON, JSONL, CSV, Excel.

Use outputs with OpenAI fine-tuning, Hugging Face, LangChain, LlamaIndex, Pinecone, Weaviate, Chroma, or any ML pipeline.

Run statistics are saved under SUMMARY in the key-value store when available.

Frequently asked questions

How do I create an LLM training dataset from a website?

Add your URLs (or sitemap), set max pages, click Run, then export the dataset from the Dataset tab. No code required.

Can I use this for RAG and vector databases?

Yes. Each row is clean text per URL — chunk it and embed into Pinecone, Weaviate, Chroma, or similar tools.

Does this work for documentation and blog scraping?

Yes. It is built for article-style HTML: docs, blogs, help centers, and news sites.

Do I need to code?

No. Configure everything in Apify Console. Use the API tab when you want automation.

Why are saved pages fewer than max pages?

Pages can be filtered for length, language, quality, or duplicates. That is expected.

What is the best export format for fine-tuning?

JSONL is common for LLM pipelines; CSV works for spreadsheets and quick review.

Where is my data stored?

In your Apify Dataset for each run. Download anytime from the Console or via API.

Disclaimer and support

Only crawl sites you are permitted to access. Follow terms of service, copyright, and robots.txt.

Our Actors collect only publicly visible content on pages you choose. Do not scrape personal data without a lawful reason. Consult your legal advisor if unsure.

Help: Issues tab on this Store page · API tab for programmatic runs · Apify documentation

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

Moses Ndambuki

Ai Training Data Curator

omarchydev/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Omarchy Dev

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Boztek LTD

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Cody Churchwell

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

nezha

5.0

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Extreme Scrapes

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.