LLM Training Data Extractor
Pricing
Pay per usage
LLM Training Data Extractor
Extract clean training data from websites for LLMs. Output raw text, Q&A pairs, or instruction-response format.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Actor stats
0
Bookmarked
2
Total users
2
Monthly active users
3 days ago
Last modified
Categories
Share
Extract clean training data from websites for LLMs. Output raw text, Q&A pairs, or instruction-response format.
Features
- Extracts structured data using fast HTML parsing (Cheerio)
- Configurable input parameters with sensible defaults
- Proxy support for reliable access
- Automatic retries on failure
- Results saved to Apify Dataset in JSON, CSV, or Excel format
Input Parameters
| Field | Type | Description | Default |
|---|---|---|---|
startUrls | array | URLs to begin crawling for training data extraction | [{"url":"https://example.com"}] |
maxPages | integer | Maximum number of pages to crawl | 100 |
outputFormat | string | Format for extracted training data | "raw" |
minTextLength | integer | Minimum character length for extracted content | 100 |
excludeSelectors | array | CSS selectors for elements to exclude (in addition to default nav, footer, ads) | [] |
Usage
- Via Apify Console: Set input parameters in the UI and click "Start"
- Via API: Send a POST request to the Actor's run endpoint with your input JSON
- Via Apify SDK: Use
Actor.call('tropical_quince/llm-training-data-extractor', input)
Output
Results are stored in the default dataset. You can download them in JSON, CSV, or Excel format from the Storage tab in the Apify Console.
Pricing
This actor uses Pay-Per-Event pricing. You are charged per result scraped. Check the Pricing tab for current rates.
Proxy
The actor supports both datacenter and residential proxies. Enable residential proxies via the useResidentialProxy input parameter for sites with aggressive anti-bot protection.