Pricing

$4.00 / 1,000 page scrapeds

AI Training Data Scraper - LLM and RAG-Ready

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

Pricing

$4.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

George Kioko

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

AI Training Data Scraper - LLM & RAG-Ready Web Content Extractor

Turn any website into clean, chunked, token-counted training data for OpenAI, Claude, and any LLM pipeline -- in one click.

Why This Actor?

Building AI applications is hard enough without spending hours cleaning scraped web data. Every RAG pipeline, fine-tuning job, and knowledge base starts with the same painful step: getting web content into a format your model can actually use.

Generic web scrapers give you raw HTML soup. You then spend hours writing custom parsers, chunking logic, and format converters. The free Website Content Crawler on Apify is great for basic scraping -- but it was not built for AI workflows. It does not chunk text, does not count tokens, does not score content quality, and does not output in LLM-ready formats.

This actor solves that entire pipeline in one step. Point it at any URL, and it delivers perfectly chunked, token-counted, quality-scored content in the exact format your AI stack expects.

URL --> Crawl --> Extract --> Clean --> Chunk --> Format --> Output
 |       |          |          |         |          |         |
 |  Puppeteer   Remove      Normalize  Smart     OpenAI   Dataset
 |  browser     boilerplate  unicode   paragraph  JSONL    items
 |  rendering   nav/ads     whitespace  + sentence Claude   ready
 |              scripts     control     boundary   Markdown  to use
 |              footers     chars       splitting  Raw text

Feature Comparison

Feature	Free Website Content Crawler	AI Training Data Scraper
Basic web scraping	Yes	Yes
JavaScript rendering	Yes	Yes (Puppeteer)
LLM-ready output formats	No	OpenAI JSONL, Claude JSONL, Markdown, Raw
Intelligent text chunking	No	Paragraph + sentence-aware splitting
Configurable chunk size & overlap	No	Yes (token-based)
Token counting per chunk	No	Yes (BPE estimate)
Content quality scoring	No	Yes (0-100 scale)
Metadata extraction	Basic	Title, author, date, language, description
Boilerplate removal	Basic	Configurable CSS selector exclusion
Multi-page crawling	Yes	Yes (with depth control)

What Data Does It Extract?

Each output item (one per chunk) contains:

Field	Description
`url`	Source page URL
`chunkIndex`	Index of this chunk (0-based)
`totalChunks`	Total chunks from this page
`tokenCount`	Estimated token count (words x 1.3)
`wordCount`	Exact word count
`title`	Page title from `<title>` or Open Graph
`author`	Author from meta tags (if available)
`date`	Publication date from meta tags (if available)
`lang`	Page language (defaults to "en")
`description`	Meta description
`qualityScore`	Content quality 0-100 (text density, paragraph richness, sentence quality)
`scrapedAt`	ISO timestamp of extraction
`messages` / `prompt` / `text`	The actual content in your chosen format

5 Use Cases

1. RAG Pipeline Ingestion

Feed chunks directly into your vector database (Pinecone, Weaviate, Chroma). Each chunk is pre-sized for embedding models, with overlap to preserve context across boundaries.

2. LLM Fine-Tuning Datasets

Output in OpenAI JSONL format ready for openai api fine_tuning.jobs.create. Each chunk becomes a training example with proper system/user/assistant message structure.

3. Knowledge Base Construction

Build internal knowledge bases from documentation sites, wikis, and help centers. Quality scoring automatically filters out low-value pages.

4. Content Analysis & Research

Extract and normalize content from multiple sources for comparative analysis. Metadata extraction captures authorship, dates, and language for structured research datasets.

5. Competitive Intelligence

Monitor competitor blogs, documentation, and product pages. Clean structured output makes it easy to track changes and analyze content strategies over time.

Input Parameters

Parameter	Type	Default	Description
`startUrls`	array	(required)	URLs to scrape
`maxPages`	integer	10	Maximum pages to crawl total
`maxDepth`	integer	1	Link-following depth (0 = start URLs only)
`chunkSize`	integer	1000	Target chunk size in tokens
`chunkOverlap`	integer	100	Overlap tokens between consecutive chunks
`outputFormat`	enum	`jsonl_openai`	One of: `jsonl_openai`, `jsonl_claude`, `markdown`, `raw_text`
`includeMetadata`	boolean	`true`	Include extracted metadata per chunk
`minContentLength`	integer	100	Skip pages with fewer characters
`excludeSelectors`	string	`nav, footer, header, .sidebar, .ads, .cookie-banner, script, style`	CSS selectors to remove
`maxConcurrency`	integer	5	Parallel page limit

Output Examples

OpenAI JSONL Format (`jsonl_openai`)

{
    "url": "https://example.com/article",
    "chunkIndex": 0,
    "totalChunks": 3,
    "tokenCount": 847,
    "wordCount": 651,
    "title": "Understanding Transformers",
    "qualityScore": 82,
    "scrapedAt": "2026-03-08T12:00:00.000Z",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. The following content was extracted from a web page for training purposes."
        },
        {
            "role": "user",
            "content": "Source: https://example.com/article | Title: Understanding Transformers"
        },
        {
            "role": "assistant",
            "content": "Transformers are a type of neural network architecture..."
        }
    ]
}

Claude JSONL Format (`jsonl_claude`)

{
    "url": "https://example.com/article",
    "chunkIndex": 0,
    "totalChunks": 3,
    "tokenCount": 847,
    "prompt": "\n\nHuman: The following is extracted content from https://example.com/article (Understanding Transformers). Please process this information.\n\nAssistant:",
    "completion": " Transformers are a type of neural network architecture..."
}

Markdown Format (`markdown`)

{
    "url": "https://example.com/article",
    "chunkIndex": 0,
    "totalChunks": 3,
    "tokenCount": 847,
    "text": "---\nurl: \"https://example.com/article\"\ntitle: \"Understanding Transformers\"\nlanguage: \"en\"\nchunk: 1/3\ntokens: 847\nwords: 651\n---\n\nTransformers are a type of neural network architecture..."
}

Raw Text Format (`raw_text`)

{
    "url": "https://example.com/article",
    "chunkIndex": 0,
    "totalChunks": 3,
    "tokenCount": 847,
    "text": "Transformers are a type of neural network architecture..."
}

Pricing

This actor uses the Pay Per Event (PPE) pricing model on Apify.

Event	Price
Actor start	$0.005
Per page scraped	$0.004

Example cost: 1,000 pages = $4.005 (Tier 1 pricing)

This is significantly cheaper than building and maintaining your own scraping infrastructure, and you get LLM-ready output without any post-processing.

FAQ

Q: How accurate is the token count? A: The actor uses a words x 1.3 heuristic which closely approximates BPE tokenizer output for English text. For precise counts, run the output through tiktoken or your model's native tokenizer.

Q: Can I scrape JavaScript-heavy (SPA) sites? A: Yes. The actor uses Puppeteer with a full Chromium browser, so it renders JavaScript before extracting content.

Q: What happens with pages behind login walls? A: Pages requiring authentication will be skipped. For authenticated scraping, consider using Apify's proxy and cookie injection features.

Q: How does chunking handle code blocks and tables? A: Code blocks and tables are treated as text blocks. They will be included in chunks but may be split if they exceed the target chunk size. For code-heavy pages, consider increasing chunkSize.

Q: Can I use this for non-English content? A: Yes. The text extraction and chunking works with any language. Token estimates may be less accurate for non-Latin scripts (CJK text typically has a higher token-per-word ratio).

Q: What is the quality score based on? A: The quality score (0-100) combines three signals: text-to-HTML density ratio (how much of the page is actual content), paragraph count (content-rich pages have more paragraphs), and average sentence length (well-written content tends toward 10-25 word sentences).

Q: How do I use the output for OpenAI fine-tuning? A: Export the dataset as JSONL from Apify. Each row is already in the correct {"messages": [...]} format. Upload directly to the OpenAI fine-tuning API.

Tips for Best Results

Start with a small run (5-10 pages) to verify the output format meets your needs before scaling up.
Tune excludeSelectors for your target site -- inspect the page and add site-specific selectors for sidebars, related articles, or other boilerplate.
Set chunkSize based on your model -- GPT-4 handles up to 128K tokens, but embedding models like text-embedding-3-small work best with 500-1000 token chunks.
Use chunkOverlap of 50-200 tokens for RAG to ensure no information is lost at chunk boundaries.
Monitor qualityScore -- pages scoring below 30 are likely navigation-heavy or boilerplate. Consider filtering them in post-processing.

If this actor saved you time, a review helps us keep improving! Your feedback directly shapes future features and updates.

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

AI Training Data Scraper (Substack / Medium)

juryless_lens/ai-training-data-scraper

Extract clean, structured text data from Substack and Medium publications — formatted as Markdown or Plain Text — ready for LLM fine-tuning, RAG pipelines, and content analysis.

Brian

RAG Pipeline Scraper — Website to Markdown & JSONL

yuchiaoniu/rag-pipeline-scraper

Transform any website into clean Markdown and JSONL ready for RAG pipelines, vector databases (Pinecone, Weaviate, Chroma), and LLM training. Removes ads, navigation, and boilerplate automatically.

Niu Yuchiao

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Extreme Scrapes

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

batuhan senavci

Q&A Dataset Extractor for LLM Fine-Tuning

deniz_schloesser/qa-dataset-extractor

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

Deniz Schlösser

AI Dataset Converter - Website to Training Data

boztek-ltd/ai-dataset-converter

Crawl websites and convert content into AI-ready formats: RAG chunks, fine-tuning JSONL, Q&A pairs, clean Markdown. Token-aware chunking, quality scoring, deduplication. No external LLM API needed.

Boztek LTD

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.