RAG Web Extractor avatar

RAG Web Extractor

Pricing

from $2.00 / 1,000 page extracteds

Go to Apify Store
RAG Web Extractor

RAG Web Extractor

Extract clean markdown from websites for RAG pipelines. Strip nav, ads, boilerplate. Preserve headings, links, images. Recursive crawling with depth control. Chunked output for embedding pipelines. Build AI knowledge bases.

Pricing

from $2.00 / 1,000 page extracteds

Rating

0.0

(0)

Developer

junipr

junipr

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

15 hours ago

Last modified

Share

RAG Web Content Extractor

Introduction

RAG Web Content Extractor is a production-grade web scraping actor that extracts clean, structured content from any web page and outputs it in formats optimized for LLM ingestion and RAG (Retrieval-Augmented Generation) pipelines. It handles JavaScript-rendered pages (SPAs, Next.js, Nuxt), infinite scroll, pagination, and complex DOM structures out of the box.

Primary use cases:

  • Feeding web content into vector databases (Pinecone, Weaviate, Qdrant, Chroma)
  • Building RAG pipelines for LLM applications
  • Structured content analysis and competitive intelligence
  • LLM fine-tuning data collection at scale

Key differentiators: Built-in configurable content chunking with overlap control, multi-format output (markdown, plain text, structured JSON) in a single run, full JavaScript rendering via Playwright, schema.org extraction, and content deduplication — all with zero-config defaults.

Why Use This Actor

FeatureRAG Web ExtractorFirecrawlweb-content-crawler (Apify)Website Content Crawler
JS renderingFull (Playwright)FullPartialPartial
Markdown outputNativeNativePluginNo
Content chunkingBuilt-in w/ overlapAPI onlyNoNo
RAG-optimized JSONNativePartialNoNo
Infinite scrollFull supportLimitedBuggyNo
schema.org extractionFullPartialNoNo
PPE pricing$3.50/1K$38/1K equiv$4.90/1KFree (low quality)
Zero-configYesRequires API keyMostlyYes
Content deduplicationBuilt-inNoNoNo

Cost comparison: At 10,000 pages/month, this actor costs $35 vs Firecrawl's ~$380 equivalent — a 90% cost reduction with more features included.

How to Use

Zero-Config Quick Start

Just provide URLs and run. Everything else has sensible defaults:

{
"startUrls": [
{ "url": "https://example.com/blog" }
]
}

That's it. The actor will extract the page content as clean markdown with full metadata. No API keys, no complex configuration.

Step-by-Step

  1. Go to the actor's page on Apify Console
  2. Add one or more URLs to the Start URLs field
  3. (Optional) Select additional output formats, enable chunking, or adjust other settings
  4. Click Start to run the actor
  5. When complete, download results from the Dataset tab

Common Configuration Recipes

RAG Pipeline Basic — Markdown output with chunking for vector databases:

{
"startUrls": [{ "url": "https://docs.example.com" }],
"outputFormats": ["markdown", "plainText"],
"enableChunking": true,
"chunkSize": 1000,
"chunkOverlap": 200,
"chunkStrategy": "semantic",
"maxDepth": 2
}

Full Site Crawl — Crawl an entire domain for comprehensive content extraction:

{
"startUrls": [{ "url": "https://example.com" }],
"maxPages": 5000,
"maxDepth": 5,
"outputFormats": ["markdown", "structuredJson"],
"extractTables": true,
"renderJs": false
}

JS-Heavy SPA — Extract content from React/Next.js/Vue apps:

{
"startUrls": [{ "url": "https://app.example.com/docs" }],
"renderJs": true,
"waitForSelector": "#main-content",
"waitForTimeout": 10000,
"outputFormats": ["markdown", "plainText"],
"enableChunking": true
}

Input Configuration

ParameterTypeDefaultDescription
startUrlsarrayrequiredURLs to scrape
maxPagesinteger100Max pages per run (1-100,000)
maxDepthinteger0Link-following depth (0 = start URLs only)
outputFormatsarray["markdown"]Output formats: markdown, plainText, structuredJson, html
enableChunkingbooleanfalseSplit content into RAG-ready chunks
chunkSizeinteger1000Target chunk size in characters (100-10,000)
chunkOverlapinteger200Overlap between chunks
chunkStrategystring"semantic"Chunking strategy: semantic, fixed, sentence
renderJsbooleantrueUse Playwright for JS rendering
waitForSelectorstringnullCSS selector to wait for before extraction
handleInfiniteScrollbooleanfalseScroll to load lazy content
handlePaginationbooleanfalseFollow pagination automatically
removeNavigationbooleantrueAuto-remove nav/header/footer
removeAdsbooleantrueAuto-remove ad elements
extractMetadatabooleantrueExtract OG tags, meta, JSON-LD
extractTablesbooleanfalseExtract HTML tables as structured data
deduplicateContentbooleantrueSkip duplicate pages

See the Input Schema tab for the complete list of parameters with detailed descriptions.

Output Format

Each scraped page produces a result object with the following structure:

Markdown Output

{
"url": "https://example.com/blog/post-1",
"statusCode": 200,
"metadata": {
"title": "How to Build a RAG Pipeline",
"author": "Jane Doe",
"wordCount": 2500,
"readingTimeMinutes": 10.5
},
"content": {
"markdown": "# How to Build a RAG Pipeline\n\nRAG (Retrieval-Augmented Generation) is..."
}
}

Chunk Output

{
"chunks": [
{
"chunkIndex": 0,
"totalChunks": 5,
"text": "RAG (Retrieval-Augmented Generation) is a technique...",
"charCount": 980,
"tokenEstimate": 245,
"headingContext": "Introduction",
"metadata": {
"sourceUrl": "https://example.com/blog/post-1",
"chunkStrategy": "semantic",
"chunkSize": 1000,
"overlap": 200
}
}
]
}

Integration with Vector Databases

LangChain (Python):

from langchain.document_loaders import ApifyDatasetLoader
loader = ApifyDatasetLoader(
dataset_id="your-dataset-id",
dataset_mapping_function=lambda item: Document(
page_content=item["content"]["markdown"],
metadata={"source": item["url"], "title": item["metadata"]["title"]}
)
)
docs = loader.load()

LlamaIndex (Python):

from llama_index import download_loader
ApifyActor = download_loader("ApifyActor")
reader = ApifyActor()
documents = reader.load_data(
actor_id="junipr/rag-web-extractor",
run_input={"startUrls": [{"url": "https://example.com"}]},
dataset_mapping_function=lambda item: Document(
text=item["content"]["markdown"],
extra_info={"url": item["url"]}
)
)

Tips and Advanced Usage

Performance Optimization

  • Set renderJs: false for static sites — it's 10x faster and uses less compute
  • Use includeSelectors to target specific content areas instead of processing the entire page
  • For large crawls, start with maxPages: 10 to verify output quality before scaling up
  • Set maxDepth: 0 if you only need the start URLs (no link following)

Proxy Configuration

  • Default: Apify datacenter proxies (fastest, cheapest)
  • For sites that block datacenter IPs, switch to residential proxies via the proxy settings
  • You can also provide your own proxy URLs

Chunking Strategy Guide

  • Semantic (default): Best for most RAG use cases. Splits on paragraph/heading boundaries, preserving context. Each chunk is self-contained.
  • Fixed: Best for uniform embedding sizes. Splits at exact character counts regardless of content structure.
  • Sentence: Best for Q&A and chat applications. Preserves complete sentences within each chunk.
  • Chunk size tip for OpenAI: Use 500-1000 characters (125-250 tokens) for text-embedding-ada-002. Use 1000-2000 for text-embedding-3-large.

Custom Selectors

For complex layouts, use includeSelectors to extract only the main content:

{
"includeSelectors": ["article.post-content", "div.documentation-body"],
"removeSelectors": [".comments", ".related-posts", ".social-share"]
}

Pricing

This actor uses Pay-Per-Event (PPE) pricing at $3.50 per 1,000 extracted pages.

A billable event occurs when the actor successfully loads a URL, extracts content, and pushes the result to the dataset. You are NOT charged for failed requests, CAPTCHAs, paywalls, filtered pages, or duplicates.

Cost Examples

ScenarioPagesCost
Blog extraction (50 posts)50$0.18
Documentation site (500 pages)500$1.75
News site daily scrape (200 articles)200$0.70
Full site crawl (10,000 pages)10,000$35.00
Enterprise RAG pipeline (100K pages/mo)100,000$350.00

Plus standard Apify platform compute costs based on memory and runtime.

FAQ

How does this compare to Firecrawl?

This actor is 75-90% cheaper than Firecrawl at scale ($3.50/1K vs ~$38/1K equivalent) with no monthly subscription. It includes built-in chunking with configurable strategies, content deduplication, and runs on Apify infrastructure so there's no API key management. Firecrawl requires separate API calls for chunking and charges monthly fees on top of per-page costs.

Can it handle JavaScript-rendered pages?

Yes. When renderJs is enabled (the default), the actor uses a full Playwright browser to render pages. This handles React, Next.js, Vue, Angular, and any other SPA framework. You can also use waitForSelector to wait for specific elements to load before extraction.

What chunk size should I use for OpenAI embeddings?

For text-embedding-ada-002, use 500-1000 characters (roughly 125-250 tokens). For text-embedding-3-large, you can go up to 2000 characters. Set chunkOverlap to 100-200 characters (10-20% of chunk size) to maintain context across chunk boundaries.

Does it respect robots.txt?

Yes. The respectRobotsTxt option is enabled by default. Pages blocked by robots.txt will be skipped with a ROBOTS_BLOCKED error code. You can disable this if needed, but please be responsible.

How do I scrape pages behind a login?

Use the cookies input parameter to provide session cookies, or use httpHeaders to pass authentication tokens. For complex auth flows, consider using a pre-login actor to establish a session first.

What's the maximum number of pages per run?

Up to 100,000 pages per run. For very large crawls, increase the actor memory to 8192 MB and set an appropriate timeout (up to 24 hours).

Can I use my own proxies?

Yes. In the proxyConfiguration input, you can provide your own proxy URLs instead of using Apify's built-in proxies.

How is a "result" defined for pricing?

A result is one successfully extracted page that produces at least one non-empty output format and is pushed to the dataset. Failed requests, CAPTCHAs, paywalls, filtered pages (below minContentLength), and deduplicated pages are not charged.