Ai Content Scraper Cleaner avatar
Ai Content Scraper Cleaner

Pricing

Pay per usage

Go to Apify Store
Ai Content Scraper Cleaner

Ai Content Scraper Cleaner

AI Content Scraper & Cleaner — Scrapes structured content (documentation, articles, FAQs, blog posts) and converts it into clean, normalized JSON datasets for LLM training. Extracts text, detects content types, estimates tokens, and removes boilerplate to produce ready-to-use training data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

JEEVAN JYOTI DASH

JEEVAN JYOTI DASH

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

2 days ago

Last modified

Share

AI Content Scraper & Cleaner

An Apify Actor that scrapes structured content (documentation, articles, FAQs, blog posts) and automatically converts it into clean, normalized JSON datasets suitable for LLM training and fine-tuning.

🚀 Features

  • Intelligent Content Extraction: Automatically extracts main content using configurable CSS selectors
  • Content Type Detection: Automatically detects content types (FAQ, article, guide, documentation, blog)
  • Text Cleaning: Removes HTML tags, scripts, styles, and normalizes whitespace
  • Token Estimation: Estimates token counts for LLM training (useful for dataset planning)
  • Language Detection: Optional language filtering support
  • Respectful Crawling: Honors robots.txt and implements rate limiting
  • Proxy Support: Built-in Apify Proxy integration for reliable scraping
  • Structured Output: Clean JSON dataset items with metadata

📋 Input Parameters

ParameterTypeDefaultDescription
startUrlsstring-Comma-separated list of URLs to start crawling from (required)
maxRequestsPerCrawlstring"50"Maximum number of requests allowed for this run
contentSelectorsstring"article, .doc-content, .post-content"Comma-separated CSS selectors for main content extraction
titleSelectorsstring"h1, .post-title"Comma-separated CSS selectors for title extraction
minimumTextLengthstring"300"Ignore content shorter than this many characters
contentTypestring"auto"Content type override (auto, faq, article, guide, documentation, blog, other)
maxDepthstring"2"Maximum link-following depth from start URL
respectRobotsTxtstring"true"Whether to honor robots.txt rules
useProxystring"true"Rotate proxies via Apify proxy when available
languagestring""Optional language code filter (e.g., "en")

📤 Output

The Actor outputs structured JSON dataset items with the following fields:

  • url: Source URL of the scraped content
  • title: Extracted page title
  • content: Cleaned text content (HTML removed, normalized)
  • contentType: Detected or specified content type
  • tokensEstimate: Estimated token count for LLM training
  • language: Detected or specified language code
  • extractedAt: ISO timestamp of extraction

🛠️ Installation & Usage

Prerequisites

Local Development

  1. Clone or navigate to the Actor directory:

    $cd AI-Ready-Dataset
  2. Install dependencies:

    $npm install
  3. Configure input: Edit input.json with your target URLs:

    {
    "startUrls": "https://example.com/docs, https://example.com/blog",
    "maxRequestsPerCrawl": "100",
    "minimumTextLength": "300"
    }
  4. Run locally:

    $apify run
  5. View results: Check storage/datasets/default/ for scraped data

Deploy to Apify Cloud

  1. Authenticate:

    $apify login
  2. Deploy:

    $apify push
  3. Run on Apify:

    • Use the Apify Console UI, or
    • Use CLI: apify call <actor-id>

📝 Example Input

{
"startUrls": "https://crawlee.dev/docs/introduction, https://docs.apify.com/platform/actors",
"maxRequestsPerCrawl": "50",
"contentSelectors": "article, .doc-content, .post-content",
"titleSelectors": "h1, .post-title",
"minimumTextLength": "300",
"contentType": "auto",
"maxDepth": "2",
"respectRobotsTxt": "true",
"useProxy": "true",
"language": "en"
}

🎯 Use Cases

  • LLM Training Data Collection: Scrape documentation and articles for fine-tuning language models
  • Knowledge Base Building: Extract structured content from documentation sites
  • Content Analysis: Collect and analyze text content from multiple sources
  • Dataset Creation: Build custom datasets for machine learning projects
  • Content Migration: Extract content from websites for migration or archival

🔧 How It Works

  1. URL Discovery: Starts from provided URLs and follows links up to the specified depth
  2. Content Extraction: Uses CSS selectors to extract main content and titles
  3. Text Cleaning: Removes HTML, scripts, styles, and normalizes whitespace
  4. Content Classification: Automatically detects content type using heuristics
  5. Token Estimation: Calculates approximate token counts for LLM training
  6. Data Storage: Saves cleaned, structured data to Apify Dataset

📊 Content Type Detection

The Actor automatically detects content types using heuristics:

  • FAQ: Contains "faq" or "frequently asked" keywords
  • Guide: Contains "how to", "step", or "guide" keywords
  • Documentation: Contains "documentation" or "api reference" keywords
  • Article: Long-form content (>1000 words) or default fallback

⚙️ Configuration Tips

For Documentation Sites

{
"contentSelectors": "article, .doc-content, .documentation-content, main",
"titleSelectors": "h1, .doc-title, .page-title"
}

For Blog Sites

{
"contentSelectors": "article, .post-content, .entry-content, .blog-post",
"titleSelectors": "h1, .post-title, .entry-title"
}

For FAQ Pages

{
"contentSelectors": ".faq, .faq-item, .question-answer, article",
"minimumTextLength": "100"
}

🚨 Important Notes

  • Respect robots.txt: The Actor respects robots.txt by default. Disable only if you have permission
  • Rate Limiting: Built-in delays prevent overloading target servers
  • Content Filtering: Use minimumTextLength to filter out navigation and boilerplate
  • Proxy Usage: Apify Proxy helps avoid IP blocking and rate limits

📚 Resources

🤝 Contributing

This Actor follows Apify Actor best practices:

  • Uses CheerioCrawler for fast static HTML scraping
  • Implements proper error handling and retry logic
  • Respects website terms and robots.txt
  • Provides clean, structured output

📄 License

ISC

💡 Tips for Best Results

  1. Start Small: Test with a few URLs first to verify selectors work
  2. Adjust Selectors: Different sites need different CSS selectors - customize as needed
  3. Set Depth Carefully: Higher depth = more pages but longer runtime
  4. Filter by Length: Use minimumTextLength to avoid capturing navigation/headers
  5. Monitor Progress: Check the Apify Console for real-time crawling progress

Built with ❤️ using Apify SDK and Crawlee