AI Training Data Scraper - LLM and RAG-Ready avatar

AI Training Data Scraper - LLM and RAG-Ready

Pricing

Pay per usage

Go to Apify Store
AI Training Data Scraper - LLM and RAG-Ready

AI Training Data Scraper - LLM and RAG-Ready

Pricing

Pay per usage

Rating

0.0

(0)

Developer

George Kioko

George Kioko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

AI Training Data Scraper - LLM & RAG-Ready Web Content Extractor

Turn any website into clean, chunked, token-counted training data for OpenAI, Claude, and any LLM pipeline -- in one click.


Why This Actor?

Building AI applications is hard enough without spending hours cleaning scraped web data. Every RAG pipeline, fine-tuning job, and knowledge base starts with the same painful step: getting web content into a format your model can actually use.

Generic web scrapers give you raw HTML soup. You then spend hours writing custom parsers, chunking logic, and format converters. The free Website Content Crawler on Apify is great for basic scraping -- but it was not built for AI workflows. It does not chunk text, does not count tokens, does not score content quality, and does not output in LLM-ready formats.

This actor solves that entire pipeline in one step. Point it at any URL, and it delivers perfectly chunked, token-counted, quality-scored content in the exact format your AI stack expects.

URL --> Crawl --> Extract --> Clean --> Chunk --> Format --> Output
| | | | | | |
| Puppeteer Remove Normalize Smart OpenAI Dataset
| browser boilerplate unicode paragraph JSONL items
| rendering nav/ads whitespace + sentence Claude ready
| scripts control boundary Markdown to use
| footers chars splitting Raw text

Feature Comparison

FeatureFree Website Content CrawlerAI Training Data Scraper
Basic web scrapingYesYes
JavaScript renderingYesYes (Puppeteer)
LLM-ready output formatsNoOpenAI JSONL, Claude JSONL, Markdown, Raw
Intelligent text chunkingNoParagraph + sentence-aware splitting
Configurable chunk size & overlapNoYes (token-based)
Token counting per chunkNoYes (BPE estimate)
Content quality scoringNoYes (0-100 scale)
Metadata extractionBasicTitle, author, date, language, description
Boilerplate removalBasicConfigurable CSS selector exclusion
Multi-page crawlingYesYes (with depth control)

What Data Does It Extract?

Each output item (one per chunk) contains:

FieldDescription
urlSource page URL
chunkIndexIndex of this chunk (0-based)
totalChunksTotal chunks from this page
tokenCountEstimated token count (words x 1.3)
wordCountExact word count
titlePage title from <title> or Open Graph
authorAuthor from meta tags (if available)
datePublication date from meta tags (if available)
langPage language (defaults to "en")
descriptionMeta description
qualityScoreContent quality 0-100 (text density, paragraph richness, sentence quality)
scrapedAtISO timestamp of extraction
messages / prompt / textThe actual content in your chosen format

5 Use Cases

1. RAG Pipeline Ingestion

Feed chunks directly into your vector database (Pinecone, Weaviate, Chroma). Each chunk is pre-sized for embedding models, with overlap to preserve context across boundaries.

2. LLM Fine-Tuning Datasets

Output in OpenAI JSONL format ready for openai api fine_tuning.jobs.create. Each chunk becomes a training example with proper system/user/assistant message structure.

3. Knowledge Base Construction

Build internal knowledge bases from documentation sites, wikis, and help centers. Quality scoring automatically filters out low-value pages.

4. Content Analysis & Research

Extract and normalize content from multiple sources for comparative analysis. Metadata extraction captures authorship, dates, and language for structured research datasets.

5. Competitive Intelligence

Monitor competitor blogs, documentation, and product pages. Clean structured output makes it easy to track changes and analyze content strategies over time.


Input Parameters

ParameterTypeDefaultDescription
startUrlsarray(required)URLs to scrape
maxPagesinteger10Maximum pages to crawl total
maxDepthinteger1Link-following depth (0 = start URLs only)
chunkSizeinteger1000Target chunk size in tokens
chunkOverlapinteger100Overlap tokens between consecutive chunks
outputFormatenumjsonl_openaiOne of: jsonl_openai, jsonl_claude, markdown, raw_text
includeMetadatabooleantrueInclude extracted metadata per chunk
minContentLengthinteger100Skip pages with fewer characters
excludeSelectorsstringnav, footer, header, .sidebar, .ads, .cookie-banner, script, styleCSS selectors to remove
maxConcurrencyinteger5Parallel page limit

Output Examples

OpenAI JSONL Format (jsonl_openai)

{
"url": "https://example.com/article",
"chunkIndex": 0,
"totalChunks": 3,
"tokenCount": 847,
"wordCount": 651,
"title": "Understanding Transformers",
"qualityScore": 82,
"scrapedAt": "2026-03-08T12:00:00.000Z",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. The following content was extracted from a web page for training purposes."
},
{
"role": "user",
"content": "Source: https://example.com/article | Title: Understanding Transformers"
},
{
"role": "assistant",
"content": "Transformers are a type of neural network architecture..."
}
]
}

Claude JSONL Format (jsonl_claude)

{
"url": "https://example.com/article",
"chunkIndex": 0,
"totalChunks": 3,
"tokenCount": 847,
"prompt": "\n\nHuman: The following is extracted content from https://example.com/article (Understanding Transformers). Please process this information.\n\nAssistant:",
"completion": " Transformers are a type of neural network architecture..."
}

Markdown Format (markdown)

{
"url": "https://example.com/article",
"chunkIndex": 0,
"totalChunks": 3,
"tokenCount": 847,
"text": "---\nurl: \"https://example.com/article\"\ntitle: \"Understanding Transformers\"\nlanguage: \"en\"\nchunk: 1/3\ntokens: 847\nwords: 651\n---\n\nTransformers are a type of neural network architecture..."
}

Raw Text Format (raw_text)

{
"url": "https://example.com/article",
"chunkIndex": 0,
"totalChunks": 3,
"tokenCount": 847,
"text": "Transformers are a type of neural network architecture..."
}

Pricing

This actor uses the Pay Per Event (PPE) pricing model on Apify.

EventPrice
Actor start$0.005
Per page scraped$0.004

Example cost: 1,000 pages = $4.005 (Tier 1 pricing)

This is significantly cheaper than building and maintaining your own scraping infrastructure, and you get LLM-ready output without any post-processing.


FAQ

Q: How accurate is the token count? A: The actor uses a words x 1.3 heuristic which closely approximates BPE tokenizer output for English text. For precise counts, run the output through tiktoken or your model's native tokenizer.

Q: Can I scrape JavaScript-heavy (SPA) sites? A: Yes. The actor uses Puppeteer with a full Chromium browser, so it renders JavaScript before extracting content.

Q: What happens with pages behind login walls? A: Pages requiring authentication will be skipped. For authenticated scraping, consider using Apify's proxy and cookie injection features.

Q: How does chunking handle code blocks and tables? A: Code blocks and tables are treated as text blocks. They will be included in chunks but may be split if they exceed the target chunk size. For code-heavy pages, consider increasing chunkSize.

Q: Can I use this for non-English content? A: Yes. The text extraction and chunking works with any language. Token estimates may be less accurate for non-Latin scripts (CJK text typically has a higher token-per-word ratio).

Q: What is the quality score based on? A: The quality score (0-100) combines three signals: text-to-HTML density ratio (how much of the page is actual content), paragraph count (content-rich pages have more paragraphs), and average sentence length (well-written content tends toward 10-25 word sentences).

Q: How do I use the output for OpenAI fine-tuning? A: Export the dataset as JSONL from Apify. Each row is already in the correct {"messages": [...]} format. Upload directly to the OpenAI fine-tuning API.


Tips for Best Results

  1. Start with a small run (5-10 pages) to verify the output format meets your needs before scaling up.
  2. Tune excludeSelectors for your target site -- inspect the page and add site-specific selectors for sidebars, related articles, or other boilerplate.
  3. Set chunkSize based on your model -- GPT-4 handles up to 128K tokens, but embedding models like text-embedding-3-small work best with 500-1000 token chunks.
  4. Use chunkOverlap of 50-200 tokens for RAG to ensure no information is lost at chunk boundaries.
  5. Monitor qualityScore -- pages scoring below 30 are likely navigation-heavy or boilerplate. Consider filtering them in post-processing.

If this actor saved you time, a review helps us keep improving! Your feedback directly shapes future features and updates.