AI Training Data Scraper - LLM and RAG-Ready
Pricing
Pay per usage
AI Training Data Scraper - LLM and RAG-Ready
Pricing
Pay per usage
Rating
0.0
(0)
Developer

George Kioko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
AI Training Data Scraper - LLM & RAG-Ready Web Content Extractor
Turn any website into clean, chunked, token-counted training data for OpenAI, Claude, and any LLM pipeline -- in one click.
Why This Actor?
Building AI applications is hard enough without spending hours cleaning scraped web data. Every RAG pipeline, fine-tuning job, and knowledge base starts with the same painful step: getting web content into a format your model can actually use.
Generic web scrapers give you raw HTML soup. You then spend hours writing custom parsers, chunking logic, and format converters. The free Website Content Crawler on Apify is great for basic scraping -- but it was not built for AI workflows. It does not chunk text, does not count tokens, does not score content quality, and does not output in LLM-ready formats.
This actor solves that entire pipeline in one step. Point it at any URL, and it delivers perfectly chunked, token-counted, quality-scored content in the exact format your AI stack expects.
URL --> Crawl --> Extract --> Clean --> Chunk --> Format --> Output| | | | | | || Puppeteer Remove Normalize Smart OpenAI Dataset| browser boilerplate unicode paragraph JSONL items| rendering nav/ads whitespace + sentence Claude ready| scripts control boundary Markdown to use| footers chars splitting Raw text
Feature Comparison
| Feature | Free Website Content Crawler | AI Training Data Scraper |
|---|---|---|
| Basic web scraping | Yes | Yes |
| JavaScript rendering | Yes | Yes (Puppeteer) |
| LLM-ready output formats | No | OpenAI JSONL, Claude JSONL, Markdown, Raw |
| Intelligent text chunking | No | Paragraph + sentence-aware splitting |
| Configurable chunk size & overlap | No | Yes (token-based) |
| Token counting per chunk | No | Yes (BPE estimate) |
| Content quality scoring | No | Yes (0-100 scale) |
| Metadata extraction | Basic | Title, author, date, language, description |
| Boilerplate removal | Basic | Configurable CSS selector exclusion |
| Multi-page crawling | Yes | Yes (with depth control) |
What Data Does It Extract?
Each output item (one per chunk) contains:
| Field | Description |
|---|---|
url | Source page URL |
chunkIndex | Index of this chunk (0-based) |
totalChunks | Total chunks from this page |
tokenCount | Estimated token count (words x 1.3) |
wordCount | Exact word count |
title | Page title from <title> or Open Graph |
author | Author from meta tags (if available) |
date | Publication date from meta tags (if available) |
lang | Page language (defaults to "en") |
description | Meta description |
qualityScore | Content quality 0-100 (text density, paragraph richness, sentence quality) |
scrapedAt | ISO timestamp of extraction |
messages / prompt / text | The actual content in your chosen format |
5 Use Cases
1. RAG Pipeline Ingestion
Feed chunks directly into your vector database (Pinecone, Weaviate, Chroma). Each chunk is pre-sized for embedding models, with overlap to preserve context across boundaries.
2. LLM Fine-Tuning Datasets
Output in OpenAI JSONL format ready for openai api fine_tuning.jobs.create. Each chunk becomes a training example with proper system/user/assistant message structure.
3. Knowledge Base Construction
Build internal knowledge bases from documentation sites, wikis, and help centers. Quality scoring automatically filters out low-value pages.
4. Content Analysis & Research
Extract and normalize content from multiple sources for comparative analysis. Metadata extraction captures authorship, dates, and language for structured research datasets.
5. Competitive Intelligence
Monitor competitor blogs, documentation, and product pages. Clean structured output makes it easy to track changes and analyze content strategies over time.
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | (required) | URLs to scrape |
maxPages | integer | 10 | Maximum pages to crawl total |
maxDepth | integer | 1 | Link-following depth (0 = start URLs only) |
chunkSize | integer | 1000 | Target chunk size in tokens |
chunkOverlap | integer | 100 | Overlap tokens between consecutive chunks |
outputFormat | enum | jsonl_openai | One of: jsonl_openai, jsonl_claude, markdown, raw_text |
includeMetadata | boolean | true | Include extracted metadata per chunk |
minContentLength | integer | 100 | Skip pages with fewer characters |
excludeSelectors | string | nav, footer, header, .sidebar, .ads, .cookie-banner, script, style | CSS selectors to remove |
maxConcurrency | integer | 5 | Parallel page limit |
Output Examples
OpenAI JSONL Format (jsonl_openai)
{"url": "https://example.com/article","chunkIndex": 0,"totalChunks": 3,"tokenCount": 847,"wordCount": 651,"title": "Understanding Transformers","qualityScore": 82,"scrapedAt": "2026-03-08T12:00:00.000Z","messages": [{"role": "system","content": "You are a helpful assistant. The following content was extracted from a web page for training purposes."},{"role": "user","content": "Source: https://example.com/article | Title: Understanding Transformers"},{"role": "assistant","content": "Transformers are a type of neural network architecture..."}]}
Claude JSONL Format (jsonl_claude)
{"url": "https://example.com/article","chunkIndex": 0,"totalChunks": 3,"tokenCount": 847,"prompt": "\n\nHuman: The following is extracted content from https://example.com/article (Understanding Transformers). Please process this information.\n\nAssistant:","completion": " Transformers are a type of neural network architecture..."}
Markdown Format (markdown)
{"url": "https://example.com/article","chunkIndex": 0,"totalChunks": 3,"tokenCount": 847,"text": "---\nurl: \"https://example.com/article\"\ntitle: \"Understanding Transformers\"\nlanguage: \"en\"\nchunk: 1/3\ntokens: 847\nwords: 651\n---\n\nTransformers are a type of neural network architecture..."}
Raw Text Format (raw_text)
{"url": "https://example.com/article","chunkIndex": 0,"totalChunks": 3,"tokenCount": 847,"text": "Transformers are a type of neural network architecture..."}
Pricing
This actor uses the Pay Per Event (PPE) pricing model on Apify.
| Event | Price |
|---|---|
| Actor start | $0.005 |
| Per page scraped | $0.004 |
Example cost: 1,000 pages = $4.005 (Tier 1 pricing)
This is significantly cheaper than building and maintaining your own scraping infrastructure, and you get LLM-ready output without any post-processing.
FAQ
Q: How accurate is the token count?
A: The actor uses a words x 1.3 heuristic which closely approximates BPE tokenizer output for English text. For precise counts, run the output through tiktoken or your model's native tokenizer.
Q: Can I scrape JavaScript-heavy (SPA) sites? A: Yes. The actor uses Puppeteer with a full Chromium browser, so it renders JavaScript before extracting content.
Q: What happens with pages behind login walls? A: Pages requiring authentication will be skipped. For authenticated scraping, consider using Apify's proxy and cookie injection features.
Q: How does chunking handle code blocks and tables?
A: Code blocks and tables are treated as text blocks. They will be included in chunks but may be split if they exceed the target chunk size. For code-heavy pages, consider increasing chunkSize.
Q: Can I use this for non-English content? A: Yes. The text extraction and chunking works with any language. Token estimates may be less accurate for non-Latin scripts (CJK text typically has a higher token-per-word ratio).
Q: What is the quality score based on? A: The quality score (0-100) combines three signals: text-to-HTML density ratio (how much of the page is actual content), paragraph count (content-rich pages have more paragraphs), and average sentence length (well-written content tends toward 10-25 word sentences).
Q: How do I use the output for OpenAI fine-tuning?
A: Export the dataset as JSONL from Apify. Each row is already in the correct {"messages": [...]} format. Upload directly to the OpenAI fine-tuning API.
Tips for Best Results
- Start with a small run (5-10 pages) to verify the output format meets your needs before scaling up.
- Tune
excludeSelectorsfor your target site -- inspect the page and add site-specific selectors for sidebars, related articles, or other boilerplate. - Set
chunkSizebased on your model -- GPT-4 handles up to 128K tokens, but embedding models liketext-embedding-3-smallwork best with 500-1000 token chunks. - Use
chunkOverlapof 50-200 tokens for RAG to ensure no information is lost at chunk boundaries. - Monitor
qualityScore-- pages scoring below 30 are likely navigation-heavy or boilerplate. Consider filtering them in post-processing.
If this actor saved you time, a review helps us keep improving! Your feedback directly shapes future features and updates.