LLM-Ready Web Scraper avatar
LLM-Ready Web Scraper

Pricing

$2.50/month + usage

Go to Apify Store
LLM-Ready Web Scraper

LLM-Ready Web Scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

Pricing

$2.50/month + usage

Rating

0.0

(0)

Developer

batuhan senavci

batuhan senavci

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

3 days ago

Last modified

Share

Converts web pages to clean, LLM-friendly formats. Perfect for building AI applications.

Use Cases

  • RAG Pipelines: Get chunked content ready for vector databases
  • Fine-tuning Datasets: Export as JSONL for LLM training
  • Knowledge Bases: Build AI chatbot training data
  • Content Extraction: Clean text without ads, menus, or clutter

Features

  • Automatic content extraction (removes ads, navigation, footers)
  • Multiple output formats: Markdown, JSON, JSONL
  • Optional chunking with overlap for RAG
  • Batch URL processing
  • Metadata extraction (title, description, domain)

Output Formats

Markdown

---
title: "Page Title"
url: https://example.com/page
domain: example.com
scraped_at: 2024-01-15T10:30:00Z
---
Clean page content here...

JSON

{
"url": "https://example.com",
"success": true,
"content": "Clean text content...",
"metadata": {
"title": "Page Title",
"description": "Meta description"
},
"word_count": 1500
}

JSONL (Fine-tuning)

{
"prompt": "Content from Page Title:",
"completion": "Clean text content..."
}

With Chunks (RAG-ready)

{
"chunks": [
{"chunk_id": 0, "text": "First chunk...", "word_count": 500},
{"chunk_id": 1, "text": "Second chunk...", "word_count": 500}
],
"chunk_count": 5
}

Input Parameters

ParameterTypeDefaultDescription
urlstring-Single URL to scrape
urlsarray-Multiple URLs for batch processing
outputFormatstringmarkdownOutput format: markdown, json, jsonl
includeChunksbooleanfalseSplit into RAG-ready chunks
chunkSizeinteger500Words per chunk
chunkOverlapinteger50Overlap between chunks
maxConcurrencyinteger5Parallel scraping limit

Example Input

{
"urls": [
"https://docs.python.org/3/tutorial/",
"https://docs.python.org/3/library/"
],
"outputFormat": "json",
"includeChunks": true,
"chunkSize": 500
}

Pricing

Pay only for what you use. Typical cost: $0.01-0.05 per URL depending on page size.