Quick Website Content Scraper ( Extract Text for RAG & LLMs )
Pricing
Pay per usage
Go to Apify Store

Quick Website Content Scraper ( Extract Text for RAG & LLMs )
Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

AutomateItPlease Workflow And Automaton Ops
Maintained by Community
Actor stats
1
Bookmarked
4
Total users
3
Monthly active users
3 days ago
Last modified
Categories
Share
AI Web Content Scraper
Extract clean, structured text from any website - perfect for feeding into AI models, LLMs, and RAG systems.
🚀 Features
- Universal Compatibility: Works with both static HTML and JavaScript-rendered websites (React, Vue, Angular, Next.js)
- AI-Optimized Output: Clean text with line breaks, ready for LLM consumption
- Smart Detection: Automatically detects and switches to browser mode for JS-heavy sites
- Blazing Fast: Uses HTTP for static sites, only uses browser when needed
- Batch Processing: Scrape multiple URLs in one run
- Zero Configuration: Just provide URLs and go
💡 Use Cases
- RAG Systems: Feed website content into vector databases for AI retrieval
- LLM Training: Collect clean text data for fine-tuning language models
- Content Analysis: Extract text for sentiment analysis, summarization, or classification
- Knowledge Bases: Build AI-powered chatbots with website content
- Research: Gather structured data from multiple sources
📋 Input
{"startUrls": [{ "url": "https://example.com" },{ "url": "https://another-site.com" }],"maxPages": 100}
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
startUrls | array | Yes | - | List of URLs to scrape |
maxPages | integer | No | 100 | Maximum number of pages to process |
📤 Output
Each scraped page produces:
{"url": "https://example.com","title": "Page Title","text": "All extracted text content...","wordCount": 1250,"scrapedAt": "2026-01-19T21:18:43Z"}
Output Fields
- url: Original URL scraped
- title: Page title from
<title>tag - text: Complete text content with line breaks preserved
- wordCount: Total number of words extracted
- scrapedAt: ISO timestamp of when the page was scraped
🎯 How It Works
- Fetch: Makes HTTP request to each URL
- Detect: Analyzes if the page is JavaScript-rendered
- Extract: Uses fast HTTP mode for static sites, or switches to Playwright browser for JS-rendered sites
- Clean: Removes scripts, styles, navigation, and returns only the main content
- Store: Saves structured data to dataset
🔧 Performance
- Static Sites: ~0.5-2 seconds per page
- JS-Rendered Sites: ~3-5 seconds per page (includes browser rendering)
- Throughput: Up to 100+ pages per run (configurable)
💻 Technology
- Python 3.14
- Apify SDK: Actor framework and storage
- Playwright: Browser automation for JS-rendered sites
- Beautiful Soup: HTML parsing and text extraction
- HTTPX: Fast async HTTP client
📚 Examples
Example 1: RAG System Data Collection
{"startUrls": [{ "url": "https://docs.python.org/3/" },{ "url": "https://docs.apify.com/" },{ "url": "https://playwright.dev/" }],"maxPages": 50}
Example 2: Single Page Extraction
{"startUrls": [{ "url": "https://blog.example.com/article" }],"maxPages": 1}
🔒 Privacy & Compliance
- Respects standard web scraping practices
- No personal data collection
- Works only with publicly accessible content
- Users responsible for compliance with site ToS
🆘 Support
For issues or questions:
- Check the Apify documentation
- Open an issue in the Actor's GitHub repository
- Contact support through Apify Console
📄 License
This Actor is available for use on the Apify platform.
Made with ❤️ for the AI community