LLM-Ready Web Scraper
Pricing
$2.50/month + usage
Go to Apify Store
LLM-Ready Web Scraper
Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.
Pricing
$2.50/month + usage
Rating
0.0
(0)
Developer

batuhan senavci
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
3 days ago
Last modified
Share
Converts web pages to clean, LLM-friendly formats. Perfect for building AI applications.
Use Cases
- RAG Pipelines: Get chunked content ready for vector databases
- Fine-tuning Datasets: Export as JSONL for LLM training
- Knowledge Bases: Build AI chatbot training data
- Content Extraction: Clean text without ads, menus, or clutter
Features
- Automatic content extraction (removes ads, navigation, footers)
- Multiple output formats: Markdown, JSON, JSONL
- Optional chunking with overlap for RAG
- Batch URL processing
- Metadata extraction (title, description, domain)
Output Formats
Markdown
---title: "Page Title"url: https://example.com/pagedomain: example.comscraped_at: 2024-01-15T10:30:00Z---Clean page content here...
JSON
{"url": "https://example.com","success": true,"content": "Clean text content...","metadata": {"title": "Page Title","description": "Meta description"},"word_count": 1500}
JSONL (Fine-tuning)
{"prompt": "Content from Page Title:","completion": "Clean text content..."}
With Chunks (RAG-ready)
{"chunks": [{"chunk_id": 0, "text": "First chunk...", "word_count": 500},{"chunk_id": 1, "text": "Second chunk...", "word_count": 500}],"chunk_count": 5}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| url | string | - | Single URL to scrape |
| urls | array | - | Multiple URLs for batch processing |
| outputFormat | string | markdown | Output format: markdown, json, jsonl |
| includeChunks | boolean | false | Split into RAG-ready chunks |
| chunkSize | integer | 500 | Words per chunk |
| chunkOverlap | integer | 50 | Overlap between chunks |
| maxConcurrency | integer | 5 | Parallel scraping limit |
Example Input
{"urls": ["https://docs.python.org/3/tutorial/","https://docs.python.org/3/library/"],"outputFormat": "json","includeChunks": true,"chunkSize": 500}
Pricing
Pay only for what you use. Typical cost: $0.01-0.05 per URL depending on page size.