Quick Website Content Scraper ( Extract Text for RAG & LLMs ) avatar
Quick Website Content Scraper ( Extract Text for RAG & LLMs )

Pricing

Pay per usage

Go to Apify Store
Quick Website Content Scraper ( Extract Text for RAG & LLMs )

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Actor stats

1

Bookmarked

4

Total users

3

Monthly active users

3 days ago

Last modified

Share

AI Web Content Scraper

Extract clean, structured text from any website - perfect for feeding into AI models, LLMs, and RAG systems.

🚀 Features

  • Universal Compatibility: Works with both static HTML and JavaScript-rendered websites (React, Vue, Angular, Next.js)
  • AI-Optimized Output: Clean text with line breaks, ready for LLM consumption
  • Smart Detection: Automatically detects and switches to browser mode for JS-heavy sites
  • Blazing Fast: Uses HTTP for static sites, only uses browser when needed
  • Batch Processing: Scrape multiple URLs in one run
  • Zero Configuration: Just provide URLs and go

💡 Use Cases

  • RAG Systems: Feed website content into vector databases for AI retrieval
  • LLM Training: Collect clean text data for fine-tuning language models
  • Content Analysis: Extract text for sentiment analysis, summarization, or classification
  • Knowledge Bases: Build AI-powered chatbots with website content
  • Research: Gather structured data from multiple sources

📋 Input

{
"startUrls": [
{ "url": "https://example.com" },
{ "url": "https://another-site.com" }
],
"maxPages": 100
}

Parameters

ParameterTypeRequiredDefaultDescription
startUrlsarrayYes-List of URLs to scrape
maxPagesintegerNo100Maximum number of pages to process

📤 Output

Each scraped page produces:

{
"url": "https://example.com",
"title": "Page Title",
"text": "All extracted text content...",
"wordCount": 1250,
"scrapedAt": "2026-01-19T21:18:43Z"
}

Output Fields

  • url: Original URL scraped
  • title: Page title from <title> tag
  • text: Complete text content with line breaks preserved
  • wordCount: Total number of words extracted
  • scrapedAt: ISO timestamp of when the page was scraped

🎯 How It Works

  1. Fetch: Makes HTTP request to each URL
  2. Detect: Analyzes if the page is JavaScript-rendered
  3. Extract: Uses fast HTTP mode for static sites, or switches to Playwright browser for JS-rendered sites
  4. Clean: Removes scripts, styles, navigation, and returns only the main content
  5. Store: Saves structured data to dataset

🔧 Performance

  • Static Sites: ~0.5-2 seconds per page
  • JS-Rendered Sites: ~3-5 seconds per page (includes browser rendering)
  • Throughput: Up to 100+ pages per run (configurable)

💻 Technology

  • Python 3.14
  • Apify SDK: Actor framework and storage
  • Playwright: Browser automation for JS-rendered sites
  • Beautiful Soup: HTML parsing and text extraction
  • HTTPX: Fast async HTTP client

📚 Examples

Example 1: RAG System Data Collection

{
"startUrls": [
{ "url": "https://docs.python.org/3/" },
{ "url": "https://docs.apify.com/" },
{ "url": "https://playwright.dev/" }
],
"maxPages": 50
}

Example 2: Single Page Extraction

{
"startUrls": [
{ "url": "https://blog.example.com/article" }
],
"maxPages": 1
}

🔒 Privacy & Compliance

  • Respects standard web scraping practices
  • No personal data collection
  • Works only with publicly accessible content
  • Users responsible for compliance with site ToS

🆘 Support

For issues or questions:

  • Check the Apify documentation
  • Open an issue in the Actor's GitHub repository
  • Contact support through Apify Console

📄 License

This Actor is available for use on the Apify platform.


Made with ❤️ for the AI community