AI Training Data Collector — Clean Web Datasets for LLMs avatar

AI Training Data Collector — Clean Web Datasets for LLMs

Pricing

Pay per event

Go to Apify Store
AI Training Data Collector — Clean Web Datasets for LLMs

AI Training Data Collector — Clean Web Datasets for LLMs

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Avinash

Avinash

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 hours ago

Last modified

Categories

Share

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates content, and scores quality automatically.

What it does

  1. Cleans HTML content — strips nav, headers, footers, ads, scripts, cookies, comments, and sidebars automatically
  2. Finds the main content — intelligently targets article, main, [role="main"], .content, .post-content, .entry-content, #content before falling back to body
  3. Converts to structured formats — markdown, plain text, or JSON output
  4. Scores content quality — 0-100 score based on length, word diversity, sentence structure, and document formatting
  5. Deduplicates pages — skips duplicate content using MD5 hashing of the first 2,000 characters
  6. Crawls to configurable depth — follows same-origin internal links up to 3 levels deep

Input

FieldTypeDefaultDescription
urlsarray[Wikipedia AI]Starting URLs to crawl
crawlDepthinteger1Link depth to follow (0-3). 0 = only start URLs
maxPagesinteger5Maximum pages to process per run (1-1000)
outputFormatstringmarkdownmarkdown, plainText, or json
excludePatternsarray[/tag/, /category/]URL path patterns to skip
minWordCountinteger100Skip pages below this word count threshold

Output

Dataset Schema

FieldTypeDescription
urlstringSource URL
titlestringPage title or first H1
cleanTextstringExtracted clean content (markdown/plain text)
structuredContentobject{title, body} — only when outputFormat: json
wordCountintegerTotal words extracted
qualityScoreinteger0-100 quality score
sourceDomainstringDomain name (www stripped)
languagestringen or unknown
crawlDepthintegerDepth level where page was found
headingCountintegerNumber of H1-H6 tags
paragraphCountintegerNumber of <p> tags
linkCountintegerNumber of <a> tags
imagesintegerNumber of <img> tags (count only, not extracted)
contentHashstringMD5 hash of first 2,000 chars for deduplication
extractionMethodstringAlways cheerio-html2text
scrapedAtstringISO timestamp

Output Example

{
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"title": "Artificial intelligence - Wikipedia",
"wordCount": 34968,
"qualityScore": 100,
"sourceDomain": "en.wikipedia.org",
"language": "en",
"headingCount": 71,
"paragraphCount": 180,
"linkCount": 5084,
"images": 39,
"contentHash": "603675ced43c",
"extractionMethod": "cheerio-html2text",
"cleanText": "# Artificial intelligence\n\nArtificial intelligence (AI) is...",
"scrapedAt": "2026-05-19T04:40:16.841Z"
}

How Quality Scoring Works

The qualityScore (0-100) is computed from four dimensions:

DimensionWeightHow it's calculated
Length0-35min(35, wordCount / 25)
Vocabulary diversity0-25min(25, uniqueWords / totalWords * 100)
Sentence structure0-20min(20, sentenceCount * 1.5)
Document structure0-20min(20, headingCount * 4 + paragraphCount * 0.5)

Example scores:

  • Wikipedia AI article (34,968 words, 71 headings): 100/100
  • A 500-word blog post with 5 headings and 20 paragraphs: ~60-70
  • A 200-word page with no headings: ~25-30 (likely skipped by minWordCount)

Use Cases

  • LLM fine-tuning dataset: Crawl 100 medical research articles to create a specialized healthcare training corpus
  • RAG knowledge base: Extract clean text from your company docs and blog posts for retrieval-augmented generation
  • Content analysis: Build a dataset of competitor blog posts with quality scores for content strategy
  • Academic research: Collect and deduplicate article text from journal websites

Battle-Tested Results

Test SiteWordsQuality ScoreHeadingsParagraphsLinksImages
Wikipedia — Artificial Intelligence34,968100711805,08439
  • Deduplication tested across 16 pages — correctly skipped 2 duplicate articles
  • Low-quality filtering tested at minWordCount: 500 — correctly skipped navigation-heavy index pages

Limits & Architecture Constraints

Hard Limits

LimitValueImpact
Crawler engineCheerio (no browser)Cannot execute JavaScript or scrape SPAs
Max pages1,000Hard ceiling per run
Max crawl depth3 levelsDeep pagination truncated
Same-origin onlyYesExternal links are not followed
Deduplication windowFirst 2,000 charsPages with identical intros but different bodies flagged as duplicates

Content Extraction Weaknesses

  • Unusual DOM structures: If a site doesn't use semantic HTML (<article>, <main>, .content), the actor falls back to body and may include more noise
  • JavaScript-rendered content: No Playwright = no JS execution. Content loaded via XHR/fetch is invisible
  • Paywalls & login gates: Cheerio sees raw HTML — paywall blurbs or login prompts may get extracted as "content"
  • Dynamic lazy loading: Images and content loaded on scroll are missed

When It Works Best

  • ✅ Static blogs and documentation sites
  • ✅ Wikipedia and wiki-style pages
  • ✅ News articles with semantic HTML
  • ✅ Corporate knowledge bases and help centers
  • ✅ Content-rich pages with clear article or main tags

When It Struggles

  • ❌ JavaScript-heavy SPAs (React, Vue, Angular without SSR)
  • ❌ Sites with aggressive anti-bot (Cloudflare challenges, CAPTCHA)
  • ❌ Pages where main content loads dynamically after page load
  • ❌ Heavily paginated tag/category pages (use excludePatterns to skip these)

Pricing

  • Free tier: 5 pages per run
  • Pay-per-result: $0.005 per page processed
  • Subscription: $59/month for unlimited runs

Support

Found a bug or need a custom feature? Open an issue or email support.