Ai Content Scraper Cleaner
Pricing
Pay per usage
Ai Content Scraper Cleaner
AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

JEEVAN JYOTI DASH
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
AI Content Scraper & Cleaner
An Apify Actor that scrapes structured content (documentation, articles, FAQs, blog posts) and automatically converts it into clean, normalized JSON datasets suitable for LLM training and fine-tuning.
🚀 Features
- Intelligent Content Extraction: Automatically extracts main content using configurable CSS selectors
- Content Type Detection: Automatically detects content types (FAQ, article, guide, documentation, blog)
- Text Cleaning: Removes HTML tags, scripts, styles, and normalizes whitespace
- Token Estimation: Estimates token counts for LLM training (useful for dataset planning)
- Language Detection: Optional language filtering support
- Respectful Crawling: Honors robots.txt and implements rate limiting
- Proxy Support: Built-in Apify Proxy integration for reliable scraping
- Structured Output: Clean JSON dataset items with metadata
📋 Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | string | - | Comma-separated list of URLs to start crawling from (required) |
maxRequestsPerCrawl | string | "50" | Maximum number of requests allowed for this run |
contentSelectors | string | "article, .doc-content, .post-content" | Comma-separated CSS selectors for main content extraction |
titleSelectors | string | "h1, .post-title" | Comma-separated CSS selectors for title extraction |
minimumTextLength | string | "300" | Ignore content shorter than this many characters |
contentType | string | "auto" | Content type override (auto, faq, article, guide, documentation, blog, other) |
maxDepth | string | "2" | Maximum link-following depth from start URL |
respectRobotsTxt | string | "true" | Whether to honor robots.txt rules |
useProxy | string | "true" | Rotate proxies via Apify proxy when available |
language | string | "" | Optional language code filter (e.g., "en") |
📤 Output
The Actor outputs structured JSON dataset items with the following fields:
- url: Source URL of the scraped content
- title: Extracted page title
- content: Cleaned text content (HTML removed, normalized)
- contentType: Detected or specified content type
- tokensEstimate: Estimated token count for LLM training
- language: Detected or specified language code
- extractedAt: ISO timestamp of extraction
🛠️ Installation & Usage
Prerequisites
- Node.js 20+ installed
- Apify CLI installed (Installation Guide)
Local Development
-
Clone or navigate to the Actor directory:
$cd AI-Ready-Dataset -
Install dependencies:
$npm install -
Configure input: Edit
input.jsonwith your target URLs:{"startUrls": "https://example.com/docs, https://example.com/blog","maxRequestsPerCrawl": "100","minimumTextLength": "300"} -
Run locally:
$apify run -
View results: Check
storage/datasets/default/for scraped data
Deploy to Apify Cloud
-
Authenticate:
$apify login -
Deploy:
$apify push -
Run on Apify:
- Use the Apify Console UI, or
- Use CLI:
apify call <actor-id>
📝 Example Input
{"startUrls": "https://crawlee.dev/docs/introduction, https://docs.apify.com/platform/actors","maxRequestsPerCrawl": "50","contentSelectors": "article, .doc-content, .post-content","titleSelectors": "h1, .post-title","minimumTextLength": "300","contentType": "auto","maxDepth": "2","respectRobotsTxt": "true","useProxy": "true","language": "en"}
🎯 Use Cases
- LLM Training Data Collection: Scrape documentation and articles for fine-tuning language models
- Knowledge Base Building: Extract structured content from documentation sites
- Content Analysis: Collect and analyze text content from multiple sources
- Dataset Creation: Build custom datasets for machine learning projects
- Content Migration: Extract content from websites for migration or archival
🔧 How It Works
- URL Discovery: Starts from provided URLs and follows links up to the specified depth
- Content Extraction: Uses CSS selectors to extract main content and titles
- Text Cleaning: Removes HTML, scripts, styles, and normalizes whitespace
- Content Classification: Automatically detects content type using heuristics
- Token Estimation: Calculates approximate token counts for LLM training
- Data Storage: Saves cleaned, structured data to Apify Dataset
📊 Content Type Detection
The Actor automatically detects content types using heuristics:
- FAQ: Contains "faq" or "frequently asked" keywords
- Guide: Contains "how to", "step", or "guide" keywords
- Documentation: Contains "documentation" or "api reference" keywords
- Article: Long-form content (>1000 words) or default fallback
⚙️ Configuration Tips
For Documentation Sites
{"contentSelectors": "article, .doc-content, .documentation-content, main","titleSelectors": "h1, .doc-title, .page-title"}
For Blog Sites
{"contentSelectors": "article, .post-content, .entry-content, .blog-post","titleSelectors": "h1, .post-title, .entry-title"}
For FAQ Pages
{"contentSelectors": ".faq, .faq-item, .question-answer, article","minimumTextLength": "100"}
🚨 Important Notes
- Respect robots.txt: The Actor respects robots.txt by default. Disable only if you have permission
- Rate Limiting: Built-in delays prevent overloading target servers
- Content Filtering: Use
minimumTextLengthto filter out navigation and boilerplate - Proxy Usage: Apify Proxy helps avoid IP blocking and rate limits
📚 Resources
🤝 Contributing
This Actor follows Apify Actor best practices:
- Uses CheerioCrawler for fast static HTML scraping
- Implements proper error handling and retry logic
- Respects website terms and robots.txt
- Provides clean, structured output
📄 License
ISC
🔗 Links
- Actor on Apify: View on Apify Platform
- Apify CLI: Installation Guide
💡 Tips for Best Results
- Start Small: Test with a few URLs first to verify selectors work
- Adjust Selectors: Different sites need different CSS selectors - customize as needed
- Set Depth Carefully: Higher depth = more pages but longer runtime
- Filter by Length: Use
minimumTextLengthto avoid capturing navigation/headers - Monitor Progress: Check the Apify Console for real-time crawling progress
Built with ❤️ using Apify SDK and Crawlee