LLM Training Data Extractor avatar

LLM Training Data Extractor

Pricing

Pay per usage

Go to Apify Store
LLM Training Data Extractor

LLM Training Data Extractor

Extract clean training data from websites for LLMs. Output raw text, Q&A pairs, or instruction-response format.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

2

Monthly active users

3 days ago

Last modified

Categories

Share

Extract clean training data from websites for LLMs. Output raw text, Q&A pairs, or instruction-response format.

Features

  • Extracts structured data using fast HTML parsing (Cheerio)
  • Configurable input parameters with sensible defaults
  • Proxy support for reliable access
  • Automatic retries on failure
  • Results saved to Apify Dataset in JSON, CSV, or Excel format

Input Parameters

FieldTypeDescriptionDefault
startUrlsarrayURLs to begin crawling for training data extraction[{"url":"https://example.com"}]
maxPagesintegerMaximum number of pages to crawl100
outputFormatstringFormat for extracted training data"raw"
minTextLengthintegerMinimum character length for extracted content100
excludeSelectorsarrayCSS selectors for elements to exclude (in addition to default nav, footer, ads)[]

Usage

  1. Via Apify Console: Set input parameters in the UI and click "Start"
  2. Via API: Send a POST request to the Actor's run endpoint with your input JSON
  3. Via Apify SDK: Use Actor.call('tropical_quince/llm-training-data-extractor', input)

Output

Results are stored in the default dataset. You can download them in JSON, CSV, or Excel format from the Storage tab in the Apify Console.

Pricing

This actor uses Pay-Per-Event pricing. You are charged per result scraped. Check the Pricing tab for current rates.

Proxy

The actor supports both datacenter and residential proxies. Enable residential proxies via the useResidentialProxy input parameter for sites with aggressive anti-bot protection.