RAG Web Extractor
Pricing
from $2.00 / 1,000 page extracteds
RAG Web Extractor
Extract clean markdown from websites for RAG pipelines. Strip nav, ads, boilerplate. Preserve headings, links, images. Recursive crawling with depth control. Chunked output for embedding pipelines. Build AI knowledge bases.
Pricing
from $2.00 / 1,000 page extracteds
Rating
0.0
(0)
Developer

junipr
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
RAG Web Content Extractor
Introduction
RAG Web Content Extractor is a production-grade web scraping actor that extracts clean, structured content from any web page and outputs it in formats optimized for LLM ingestion and RAG (Retrieval-Augmented Generation) pipelines. It handles JavaScript-rendered pages (SPAs, Next.js, Nuxt), infinite scroll, pagination, and complex DOM structures out of the box.
Primary use cases:
- Feeding web content into vector databases (Pinecone, Weaviate, Qdrant, Chroma)
- Building RAG pipelines for LLM applications
- Structured content analysis and competitive intelligence
- LLM fine-tuning data collection at scale
Key differentiators: Built-in configurable content chunking with overlap control, multi-format output (markdown, plain text, structured JSON) in a single run, full JavaScript rendering via Playwright, schema.org extraction, and content deduplication — all with zero-config defaults.
Why Use This Actor
| Feature | RAG Web Extractor | Firecrawl | web-content-crawler (Apify) | Website Content Crawler |
|---|---|---|---|---|
| JS rendering | Full (Playwright) | Full | Partial | Partial |
| Markdown output | Native | Native | Plugin | No |
| Content chunking | Built-in w/ overlap | API only | No | No |
| RAG-optimized JSON | Native | Partial | No | No |
| Infinite scroll | Full support | Limited | Buggy | No |
| schema.org extraction | Full | Partial | No | No |
| PPE pricing | $3.50/1K | $38/1K equiv | $4.90/1K | Free (low quality) |
| Zero-config | Yes | Requires API key | Mostly | Yes |
| Content deduplication | Built-in | No | No | No |
Cost comparison: At 10,000 pages/month, this actor costs $35 vs Firecrawl's ~$380 equivalent — a 90% cost reduction with more features included.
How to Use
Zero-Config Quick Start
Just provide URLs and run. Everything else has sensible defaults:
{"startUrls": [{ "url": "https://example.com/blog" }]}
That's it. The actor will extract the page content as clean markdown with full metadata. No API keys, no complex configuration.
Step-by-Step
- Go to the actor's page on Apify Console
- Add one or more URLs to the Start URLs field
- (Optional) Select additional output formats, enable chunking, or adjust other settings
- Click Start to run the actor
- When complete, download results from the Dataset tab
Common Configuration Recipes
RAG Pipeline Basic — Markdown output with chunking for vector databases:
{"startUrls": [{ "url": "https://docs.example.com" }],"outputFormats": ["markdown", "plainText"],"enableChunking": true,"chunkSize": 1000,"chunkOverlap": 200,"chunkStrategy": "semantic","maxDepth": 2}
Full Site Crawl — Crawl an entire domain for comprehensive content extraction:
{"startUrls": [{ "url": "https://example.com" }],"maxPages": 5000,"maxDepth": 5,"outputFormats": ["markdown", "structuredJson"],"extractTables": true,"renderJs": false}
JS-Heavy SPA — Extract content from React/Next.js/Vue apps:
{"startUrls": [{ "url": "https://app.example.com/docs" }],"renderJs": true,"waitForSelector": "#main-content","waitForTimeout": 10000,"outputFormats": ["markdown", "plainText"],"enableChunking": true}
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | URLs to scrape |
maxPages | integer | 100 | Max pages per run (1-100,000) |
maxDepth | integer | 0 | Link-following depth (0 = start URLs only) |
outputFormats | array | ["markdown"] | Output formats: markdown, plainText, structuredJson, html |
enableChunking | boolean | false | Split content into RAG-ready chunks |
chunkSize | integer | 1000 | Target chunk size in characters (100-10,000) |
chunkOverlap | integer | 200 | Overlap between chunks |
chunkStrategy | string | "semantic" | Chunking strategy: semantic, fixed, sentence |
renderJs | boolean | true | Use Playwright for JS rendering |
waitForSelector | string | null | CSS selector to wait for before extraction |
handleInfiniteScroll | boolean | false | Scroll to load lazy content |
handlePagination | boolean | false | Follow pagination automatically |
removeNavigation | boolean | true | Auto-remove nav/header/footer |
removeAds | boolean | true | Auto-remove ad elements |
extractMetadata | boolean | true | Extract OG tags, meta, JSON-LD |
extractTables | boolean | false | Extract HTML tables as structured data |
deduplicateContent | boolean | true | Skip duplicate pages |
See the Input Schema tab for the complete list of parameters with detailed descriptions.
Output Format
Each scraped page produces a result object with the following structure:
Markdown Output
{"url": "https://example.com/blog/post-1","statusCode": 200,"metadata": {"title": "How to Build a RAG Pipeline","author": "Jane Doe","wordCount": 2500,"readingTimeMinutes": 10.5},"content": {"markdown": "# How to Build a RAG Pipeline\n\nRAG (Retrieval-Augmented Generation) is..."}}
Chunk Output
{"chunks": [{"chunkIndex": 0,"totalChunks": 5,"text": "RAG (Retrieval-Augmented Generation) is a technique...","charCount": 980,"tokenEstimate": 245,"headingContext": "Introduction","metadata": {"sourceUrl": "https://example.com/blog/post-1","chunkStrategy": "semantic","chunkSize": 1000,"overlap": 200}}]}
Integration with Vector Databases
LangChain (Python):
from langchain.document_loaders import ApifyDatasetLoaderloader = ApifyDatasetLoader(dataset_id="your-dataset-id",dataset_mapping_function=lambda item: Document(page_content=item["content"]["markdown"],metadata={"source": item["url"], "title": item["metadata"]["title"]}))docs = loader.load()
LlamaIndex (Python):
from llama_index import download_loaderApifyActor = download_loader("ApifyActor")reader = ApifyActor()documents = reader.load_data(actor_id="junipr/rag-web-extractor",run_input={"startUrls": [{"url": "https://example.com"}]},dataset_mapping_function=lambda item: Document(text=item["content"]["markdown"],extra_info={"url": item["url"]}))
Tips and Advanced Usage
Performance Optimization
- Set
renderJs: falsefor static sites — it's 10x faster and uses less compute - Use
includeSelectorsto target specific content areas instead of processing the entire page - For large crawls, start with
maxPages: 10to verify output quality before scaling up - Set
maxDepth: 0if you only need the start URLs (no link following)
Proxy Configuration
- Default: Apify datacenter proxies (fastest, cheapest)
- For sites that block datacenter IPs, switch to residential proxies via the proxy settings
- You can also provide your own proxy URLs
Chunking Strategy Guide
- Semantic (default): Best for most RAG use cases. Splits on paragraph/heading boundaries, preserving context. Each chunk is self-contained.
- Fixed: Best for uniform embedding sizes. Splits at exact character counts regardless of content structure.
- Sentence: Best for Q&A and chat applications. Preserves complete sentences within each chunk.
- Chunk size tip for OpenAI: Use 500-1000 characters (125-250 tokens) for
text-embedding-ada-002. Use 1000-2000 fortext-embedding-3-large.
Custom Selectors
For complex layouts, use includeSelectors to extract only the main content:
{"includeSelectors": ["article.post-content", "div.documentation-body"],"removeSelectors": [".comments", ".related-posts", ".social-share"]}
Pricing
This actor uses Pay-Per-Event (PPE) pricing at $3.50 per 1,000 extracted pages.
A billable event occurs when the actor successfully loads a URL, extracts content, and pushes the result to the dataset. You are NOT charged for failed requests, CAPTCHAs, paywalls, filtered pages, or duplicates.
Cost Examples
| Scenario | Pages | Cost |
|---|---|---|
| Blog extraction (50 posts) | 50 | $0.18 |
| Documentation site (500 pages) | 500 | $1.75 |
| News site daily scrape (200 articles) | 200 | $0.70 |
| Full site crawl (10,000 pages) | 10,000 | $35.00 |
| Enterprise RAG pipeline (100K pages/mo) | 100,000 | $350.00 |
Plus standard Apify platform compute costs based on memory and runtime.
FAQ
How does this compare to Firecrawl?
This actor is 75-90% cheaper than Firecrawl at scale ($3.50/1K vs ~$38/1K equivalent) with no monthly subscription. It includes built-in chunking with configurable strategies, content deduplication, and runs on Apify infrastructure so there's no API key management. Firecrawl requires separate API calls for chunking and charges monthly fees on top of per-page costs.
Can it handle JavaScript-rendered pages?
Yes. When renderJs is enabled (the default), the actor uses a full Playwright browser to render pages. This handles React, Next.js, Vue, Angular, and any other SPA framework. You can also use waitForSelector to wait for specific elements to load before extraction.
What chunk size should I use for OpenAI embeddings?
For text-embedding-ada-002, use 500-1000 characters (roughly 125-250 tokens). For text-embedding-3-large, you can go up to 2000 characters. Set chunkOverlap to 100-200 characters (10-20% of chunk size) to maintain context across chunk boundaries.
Does it respect robots.txt?
Yes. The respectRobotsTxt option is enabled by default. Pages blocked by robots.txt will be skipped with a ROBOTS_BLOCKED error code. You can disable this if needed, but please be responsible.
How do I scrape pages behind a login?
Use the cookies input parameter to provide session cookies, or use httpHeaders to pass authentication tokens. For complex auth flows, consider using a pre-login actor to establish a session first.
What's the maximum number of pages per run?
Up to 100,000 pages per run. For very large crawls, increase the actor memory to 8192 MB and set an appropriate timeout (up to 24 hours).
Can I use my own proxies?
Yes. In the proxyConfiguration input, you can provide your own proxy URLs instead of using Apify's built-in proxies.
How is a "result" defined for pricing?
A result is one successfully extracted page that produces at least one non-empty output format and is pushed to the dataset. Failed requests, CAPTCHAs, paywalls, filtered pages (below minContentLength), and deduplicated pages are not charged.