Under maintenance

Pricing

from $5.00 / 1,000 url crawleds

Try for free

Go to Apify Store

LLM-Ready Web Scraper – RAG & Vertical Data Extraction

Under maintenance

Try for free

Scrapes any URL and returns clean LLM-ready content. Strips ads, nav, and boilerplate. Returns markdown, chunked text, token estimates, and metadata. Vertical modes for Legal, Medical, Property, E-commerce, Research, and News. Firecrawl alternative at $0.005 per URL.

Pricing

from $5.00 / 1,000 url crawleds

Rating

0.0

(0)

Developer

joseph fadero

Actor stats

Bookmarked

Total users

Monthly active users

15 hours ago

Last modified

LLM-Ready Web Scraper – RAG Data Extraction with Vertical Processing

The affordable Firecrawl alternative. $0.005 per URL. No subscription.

Scrapes any public URL and returns clean, structured content optimised for LLMs and RAG pipelines — stripped of navigation, ads, cookie banners, and HTML boilerplate.

What makes it different

Vertical processing modes — Legal, Medical, Property, E-commerce, Research, and News modes apply domain-specific extraction rules for better content quality
RAG-ready chunking — splits content into configurable token-sized chunks ready for embedding
Token estimation — every result includes estimated token count so you know your LLM context usage upfront
Pay per URL — $0.005/URL, no subscription

Use cases

Feed RAG pipelines with fresh web content for Claude, GPT-4, or LlamaIndex
Build AI agents that need live web data
n8n/Make: scrape URLs from a spreadsheet → get clean markdown → send to your LLM
Research aggregation: scrape multiple sources → chunk → embed → search
Legal research: extract clean text from case law and statutes
Property analysis: extract listing descriptions for AI comparison

Pricing

Event	Price
Run started	$0.05
URL crawled (no chunks)	$0.005
URL crawled (with chunking)	$0.008
URL failed	$0.001

100 URLs = $0.55 total. Firecrawl Hobby plan: $19/month for 500 URLs.

Input

Field	Default	Description
urls	required	Array of URLs to scrape
outputFormat	markdown	markdown / plaintext / json
vertical	general	general / legal / medical / property / ecommerce / research / news
chunkContent	false	Split into RAG-sized chunks
chunkTokenSize	512	Target tokens per chunk (128–4096)
includeMetadata	true	Include title, author, dates, word/token count
removeElements	[]	Extra CSS selectors to strip
followLinks	false	Follow internal links from starting URLs
maxDepth	1	Link follow depth (1–3)
maxPagesPerUrl	10	Max pages per starting URL

Output fields

url, sourceUrl, crawledAt
title, description, author, publishDate, language
wordCount, estimatedTokens
content — clean text in chosen format
vertical — which extraction mode was applied
chunks — array of { index, content, tokenEstimate } when chunking enabled
status — success / failed / partial
chargedEvent

Example n8n workflow

Apify node → this actor → Claude AI node → Google Sheets

YouTube Transcript Scraper & Captions Scraper

harvestlab/youtube-scraper

YouTube transcript scraper and captions scraper for RAG datasets, AI training, content research, creator analytics, and video monitoring. Scrape videos, channels, comments, timestamped transcript segments, and AI chapters without a YouTube API key. PPE, x402-ready, Skyfire bundle.

Nick

Meta Ads Library Scraper - Competitor Ad Intel

harvestlab/facebook-ads-library-scraper

Meta Ads Library scraper for competitor ad intelligence: Facebook and Instagram creatives, spend and impression ranges, demographics, active status, and AI creative analysis. Use Graph API input when available, with web fallback diagnostics.

Nick

Google Maps Reviews Scraper - AI Reputation Monitor

harvestlab/google-maps-reviews-scraper

Google Maps reviews scraper for reputation monitoring: place search, review text, ratings, owner replies, AI sentiment/topics, and webhook-ready datasets. Fast-fails with diagnostics when Google blocks a run. PPE at $0.0003/review.

Nick

UK Companies House Lead Scraper - Directors & PSC

harvestlab/companies-house-scraper

UK Companies House scraper for B2B lead enrichment: company search, directors, PSC beneficial owners, SIC codes, filing history, and AI company summaries. Extract clean UK company records for sales, compliance, and market research.

Nick

Reddit Answers API "Ask Reddit" - AI Insights for n8n Pipelines

clearpath/reddit-answers-api

Extract AI-powered answers in 6 languages from Reddit discussions at scale. Structured JSON + markdown for n8n, Make, and LLM pipelines. Includes full post/comment context, quotes with citations, and subreddit metadata. 6 languages supported. No login required. Pay per successful answer.

ClearPath

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

batuhan senavci

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Hounderd

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

Article Extraction API

tugelbay/article-extractor

Convert article URLs to clean Markdown, text, or HTML for RAG and LLM pipelines. Extract title, author, date, images, links, word count, canonical URL, Open Graph, and JSON-LD metadata while removing ads and boilerplate.