Pricing

Pay per usage

Ai Content Scraper Cleaner

AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

JEEVAN JYOTI DASH

Actor stats

Bookmarked

Total users

Monthly active users

7 months ago

Last modified

AI Content Scraper & Cleaner

An Apify Actor that scrapes structured content (documentation, articles, FAQs, blog posts) and automatically converts it into clean, normalized JSON datasets suitable for LLM training and fine-tuning.

🚀 Features

Intelligent Content Extraction: Automatically extracts main content using configurable CSS selectors
Content Type Detection: Automatically detects content types (FAQ, article, guide, documentation, blog)
Text Cleaning: Removes HTML tags, scripts, styles, and normalizes whitespace
Token Estimation: Estimates token counts for LLM training (useful for dataset planning)
Language Detection: Optional language filtering support
Respectful Crawling: Honors robots.txt and implements rate limiting
Proxy Support: Built-in Apify Proxy integration for reliable scraping
Structured Output: Clean JSON dataset items with metadata

📋 Input Parameters

Parameter	Type	Default	Description
`startUrls`	string	-	Comma-separated list of URLs to start crawling from (required)
`maxRequestsPerCrawl`	string	"50"	Maximum number of requests allowed for this run
`contentSelectors`	string	"article, .doc-content, .post-content"	Comma-separated CSS selectors for main content extraction
`titleSelectors`	string	"h1, .post-title"	Comma-separated CSS selectors for title extraction
`minimumTextLength`	string	"300"	Ignore content shorter than this many characters
`contentType`	string	"auto"	Content type override (auto, faq, article, guide, documentation, blog, other)
`maxDepth`	string	"2"	Maximum link-following depth from start URL
`respectRobotsTxt`	string	"true"	Whether to honor robots.txt rules
`useProxy`	string	"true"	Rotate proxies via Apify proxy when available
`language`	string	""	Optional language code filter (e.g., "en")

📤 Output

The Actor outputs structured JSON dataset items with the following fields:

url: Source URL of the scraped content
title: Extracted page title
content: Cleaned text content (HTML removed, normalized)
contentType: Detected or specified content type
tokensEstimate: Estimated token count for LLM training
language: Detected or specified language code
extractedAt: ISO timestamp of extraction

🛠️ Installation & Usage

Prerequisites

Node.js 20+ installed
Apify CLI installed (Installation Guide)

Local Development

Clone or navigate to the Actor directory:

$cd AI-Ready-Dataset

Install dependencies:
```
$npm install
```

Configure input: Edit input.json with your target URLs:

{
  "startUrls": "https://example.com/docs, https://example.com/blog",
  "maxRequestsPerCrawl": "100",
  "minimumTextLength": "300"
}

Run locally:
```
$apify run
```
View results: Check storage/datasets/default/ for scraped data

Deploy to Apify Cloud

Authenticate:
```
$apify login
```
Deploy:
```
$apify push
```
Run on Apify:
- Use the Apify Console UI, or
- Use CLI: apify call <actor-id>

📝 Example Input

{
  "startUrls": "https://crawlee.dev/docs/introduction, https://docs.apify.com/platform/actors",
  "maxRequestsPerCrawl": "50",
  "contentSelectors": "article, .doc-content, .post-content",
  "titleSelectors": "h1, .post-title",
  "minimumTextLength": "300",
  "contentType": "auto",
  "maxDepth": "2",
  "respectRobotsTxt": "true",
  "useProxy": "true",
  "language": "en"
}

🎯 Use Cases

LLM Training Data Collection: Scrape documentation and articles for fine-tuning language models
Knowledge Base Building: Extract structured content from documentation sites
Content Analysis: Collect and analyze text content from multiple sources
Dataset Creation: Build custom datasets for machine learning projects
Content Migration: Extract content from websites for migration or archival

🔧 How It Works

URL Discovery: Starts from provided URLs and follows links up to the specified depth
Content Extraction: Uses CSS selectors to extract main content and titles
Text Cleaning: Removes HTML, scripts, styles, and normalizes whitespace
Content Classification: Automatically detects content type using heuristics
Token Estimation: Calculates approximate token counts for LLM training
Data Storage: Saves cleaned, structured data to Apify Dataset

📊 Content Type Detection

The Actor automatically detects content types using heuristics:

FAQ: Contains "faq" or "frequently asked" keywords
Guide: Contains "how to", "step", or "guide" keywords
Documentation: Contains "documentation" or "api reference" keywords
Article: Long-form content (>1000 words) or default fallback

⚙️ Configuration Tips

For Documentation Sites

{
  "contentSelectors": "article, .doc-content, .documentation-content, main",
  "titleSelectors": "h1, .doc-title, .page-title"
}

For Blog Sites

{
  "contentSelectors": "article, .post-content, .entry-content, .blog-post",
  "titleSelectors": "h1, .post-title, .entry-title"
}

For FAQ Pages

{
  "contentSelectors": ".faq, .faq-item, .question-answer, article",
  "minimumTextLength": "100"
}

🚨 Important Notes

Respect robots.txt: The Actor respects robots.txt by default. Disable only if you have permission
Rate Limiting: Built-in delays prevent overloading target servers
Content Filtering: Use minimumTextLength to filter out navigation and boilerplate
Proxy Usage: Apify Proxy helps avoid IP blocking and rate limits

📚 Resources

🤝 Contributing

This Actor follows Apify Actor best practices:

Uses CheerioCrawler for fast static HTML scraping
Implements proper error handling and retry logic
Respects website terms and robots.txt
Provides clean, structured output

📄 License

ISC

🔗 Links

Actor on Apify: View on Apify Platform
Apify CLI: Installation Guide

💡 Tips for Best Results

Start Small: Test with a few URLs first to verify selectors work
Adjust Selectors: Different sites need different CSS selectors - customize as needed
Set Depth Carefully: Higher depth = more pages but longer runtime
Filter by Length: Use minimumTextLength to avoid capturing navigation/headers
Monitor Progress: Check the Apify Console for real-time crawling progress

Built with ❤️ using Apify SDK and Crawlee

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

179

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

325

5.0

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

Smart Web Content Extractor for AI & LLM

project_bbb/smart-web-content-extractor

Crawl any website and extract clean, structured content optimized for LLM consumption. Outputs Markdown, plain text, or HTML with metadata. Removes nav, ads, and boilerplate automatically.

BBB & Company

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

303

1.9

AI Training Data Collector — Clean Web Datasets for LLMs

avinashchby/ai-training-data-collector

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.

Avinash

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Ai Seo Content

vivid_astronaut/ai-seo-content

Fabio Suizu

YouTube Transcript Extractor — AI-Ready Subtitles

wsgcjj/youtube-transcript

Extracts subtitles/transcripts from YouTube videos. Input a video URL or ID, get clean text output with metadata. Ideal for AI training data collection, content analysis, and LLM training pipelines.