AI Training Data Collector — Clean Web Datasets for LLMs
Pricing
Pay per event
Go to Apify Store
AI Training Data Collector — Clean Web Datasets for LLMs
Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.
Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates content, and scores quality automatically.
What it does
- Cleans HTML content — strips nav, headers, footers, ads, scripts, cookies, comments, and sidebars automatically
- Finds the main content — intelligently targets
article,main,[role="main"],.content,.post-content,.entry-content,#contentbefore falling back tobody - Converts to structured formats — markdown, plain text, or JSON output
- Scores content quality — 0-100 score based on length, word diversity, sentence structure, and document formatting
- Deduplicates pages — skips duplicate content using MD5 hashing of the first 2,000 characters
- Crawls to configurable depth — follows same-origin internal links up to 3 levels deep
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | array | [Wikipedia AI] | Starting URLs to crawl |
crawlDepth | integer | 1 | Link depth to follow (0-3). 0 = only start URLs |
maxPages | integer | 5 | Maximum pages to process per run (1-1000) |
outputFormat | string | markdown | markdown, plainText, or json |
excludePatterns | array | [/tag/, /category/] | URL path patterns to skip |
minWordCount | integer | 100 | Skip pages below this word count threshold |
Output
Dataset Schema
| Field | Type | Description |
|---|---|---|
url | string | Source URL |
title | string | Page title or first H1 |
cleanText | string | Extracted clean content (markdown/plain text) |
structuredContent | object | {title, body} — only when outputFormat: json |
wordCount | integer | Total words extracted |
qualityScore | integer | 0-100 quality score |
sourceDomain | string | Domain name (www stripped) |
language | string | en or unknown |
crawlDepth | integer | Depth level where page was found |
headingCount | integer | Number of H1-H6 tags |
paragraphCount | integer | Number of <p> tags |
linkCount | integer | Number of <a> tags |
images | integer | Number of <img> tags (count only, not extracted) |
contentHash | string | MD5 hash of first 2,000 chars for deduplication |
extractionMethod | string | Always cheerio-html2text |
scrapedAt | string | ISO timestamp |
Output Example
{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence","title": "Artificial intelligence - Wikipedia","wordCount": 34968,"qualityScore": 100,"sourceDomain": "en.wikipedia.org","language": "en","headingCount": 71,"paragraphCount": 180,"linkCount": 5084,"images": 39,"contentHash": "603675ced43c","extractionMethod": "cheerio-html2text","cleanText": "# Artificial intelligence\n\nArtificial intelligence (AI) is...","scrapedAt": "2026-05-19T04:40:16.841Z"}
How Quality Scoring Works
The qualityScore (0-100) is computed from four dimensions:
| Dimension | Weight | How it's calculated |
|---|---|---|
| Length | 0-35 | min(35, wordCount / 25) |
| Vocabulary diversity | 0-25 | min(25, uniqueWords / totalWords * 100) |
| Sentence structure | 0-20 | min(20, sentenceCount * 1.5) |
| Document structure | 0-20 | min(20, headingCount * 4 + paragraphCount * 0.5) |
Example scores:
- Wikipedia AI article (34,968 words, 71 headings): 100/100
- A 500-word blog post with 5 headings and 20 paragraphs: ~60-70
- A 200-word page with no headings: ~25-30 (likely skipped by
minWordCount)
Use Cases
- LLM fine-tuning dataset: Crawl 100 medical research articles to create a specialized healthcare training corpus
- RAG knowledge base: Extract clean text from your company docs and blog posts for retrieval-augmented generation
- Content analysis: Build a dataset of competitor blog posts with quality scores for content strategy
- Academic research: Collect and deduplicate article text from journal websites
Battle-Tested Results
| Test Site | Words | Quality Score | Headings | Paragraphs | Links | Images |
|---|---|---|---|---|---|---|
| Wikipedia — Artificial Intelligence | 34,968 | 100 | 71 | 180 | 5,084 | 39 |
- Deduplication tested across 16 pages — correctly skipped 2 duplicate articles
- Low-quality filtering tested at
minWordCount: 500— correctly skipped navigation-heavy index pages
Limits & Architecture Constraints
Hard Limits
| Limit | Value | Impact |
|---|---|---|
| Crawler engine | Cheerio (no browser) | Cannot execute JavaScript or scrape SPAs |
| Max pages | 1,000 | Hard ceiling per run |
| Max crawl depth | 3 levels | Deep pagination truncated |
| Same-origin only | Yes | External links are not followed |
| Deduplication window | First 2,000 chars | Pages with identical intros but different bodies flagged as duplicates |
Content Extraction Weaknesses
- Unusual DOM structures: If a site doesn't use semantic HTML (
<article>,<main>,.content), the actor falls back tobodyand may include more noise - JavaScript-rendered content: No Playwright = no JS execution. Content loaded via XHR/fetch is invisible
- Paywalls & login gates: Cheerio sees raw HTML — paywall blurbs or login prompts may get extracted as "content"
- Dynamic lazy loading: Images and content loaded on scroll are missed
When It Works Best
- ✅ Static blogs and documentation sites
- ✅ Wikipedia and wiki-style pages
- ✅ News articles with semantic HTML
- ✅ Corporate knowledge bases and help centers
- ✅ Content-rich pages with clear
articleormaintags
When It Struggles
- ❌ JavaScript-heavy SPAs (React, Vue, Angular without SSR)
- ❌ Sites with aggressive anti-bot (Cloudflare challenges, CAPTCHA)
- ❌ Pages where main content loads dynamically after page load
- ❌ Heavily paginated tag/category pages (use
excludePatternsto skip these)
Pricing
- Free tier: 5 pages per run
- Pay-per-result: $0.005 per page processed
- Subscription: $59/month for unlimited runs
Support
Found a bug or need a custom feature? Open an issue or email support.