Pricing

Pay per usage

Go to Apify Store

LLM Training Data Extractor

Try for free

Extract clean training data from websites for LLMs. Output raw text, Q&A pairs, or instruction-response format.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

Features

Extracts structured data using fast HTML parsing (Cheerio)
Configurable input parameters with sensible defaults
Proxy support for reliable access
Automatic retries on failure
Results saved to Apify Dataset in JSON, CSV, or Excel format

Input Parameters

Field	Type	Description	Default
`startUrls`	array	URLs to begin crawling for training data extraction	`[{"url":"https://example.com"}]`
`maxPages`	integer	Maximum number of pages to crawl	`100`
`outputFormat`	string	Format for extracted training data	`"raw"`
`minTextLength`	integer	Minimum character length for extracted content	`100`
`excludeSelectors`	array	CSS selectors for elements to exclude (in addition to default nav, footer, ads)	`[]`

Usage

Via Apify Console: Set input parameters in the UI and click "Start"
Via API: Send a POST request to the Actor's run endpoint with your input JSON
Via Apify SDK: Use Actor.call('tropical_quince/llm-training-data-extractor', input)

Output

Results are stored in the default dataset. You can download them in JSON, CSV, or Excel format from the Storage tab in the Apify Console.

Pricing

This actor uses Pay-Per-Event pricing. You are charged per result scraped. Check the Pricing tab for current rates.

Proxy

The actor supports both datacenter and residential proxies. Enable residential proxies via the useResidentialProxy input parameter for sites with aggressive anti-bot protection.

Ai Training Data Curator

lanky_quantifier/ai-training-data-curator

Curate high-quality training datasets for AI/ML models. Extract, clean & format text data from websites, papers & forums. Perfect for LLM training, RAG systems & research.

Vhub Systems

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.

ryan clinton

Ai Training Data Enricher

fiery_dream/ai-training-data-enricher

Production-grade data enrichment and validation for LLM training datasets. Automatically clean, enrich, deduplicate, and validate your AI training data before fine-tuning.

Cody Churchwell

Ai Training Data Curator

mea/ai-training-data-curator

Crawl websites and curate high-quality training data for LLM fine-tuning. Automatic deduplication, quality scoring, and language detection. Export to JSONL, Parquet, or CSV formats ready for OpenAI, Claude, or Llama training.

Eliud Munyala

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

235

5.0

R3 | DE

gunmetal/r3-de

R³-DE is a multi-layer NLP pipeline that transforms raw text or scraped content into structured records with speakers, entities, actions, states, intent, sentiment, causality, confidence scores, and prompt–completion pairs, ideal for training chatbots, instruction-tuned LLMs, and safety-critical AI.

GUN | METAL

5.0

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

155

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

automateitplease/ai-web-content-scraper-extract-text-for-rag-llms

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.