Pricing

from $4.99 / 1,000 results

RAG Web Browser

RAG Web Browser websites and extract relevant content for Retrieval-Augmented Generation (RAG). Collect clean text, metadata, links, and documents from web pages, then export structured data to JSON, CSV, Excel, or XML for AI assistants, knowledge bases, and semantic search.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Scraper Engine

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

🌐 RAG Web Browser — Search & Scrape for AI Agents & LLM Pipelines

One actor. Any question. Clean Markdown back. Search Google → scrape the top results → return polished Markdown / HTML / plain text — ready to drop straight into your RAG pipeline, LangChain / LlamaIndex retriever, OpenAI Assistant, Claude, Gemini, or custom AI agent.

✨ Why Choose This Actor?

🔥	What you get
🚀	Blazing-fast async pipeline (aiohttp + selectolax + lexbor)
🧠	LLM-ready output — clean Markdown by default, HTML & plain text on demand
🛡️	Smart proxy ladder — starts direct, auto-upgrades to datacenter → residential if a site blocks us
🔁	Resilient retries — 3 residential attempts before giving up
🕸️	Bulk URLs or one search query — single input, two modes
🍪	Removes cookie / GDPR banners automatically
📰	Readability mode — isolates article body for cleaner context
💾	Live dataset writes — partial results survive crashes
🪟	Open-source friendly — Apify SDK 3.x, Python 3.13

🎯 Key Features

🔍 Google Search backbone — paginated, deduped, ranked results
🌐 Direct URL mode — paste a list of URLs and skip search entirely
🧹 Custom CSS scrub — strip nav, footer, scripts, modals, ads, …
📑 Per-page metadata — title, description, language, redirect chain
🔢 Per-section dataset views — Results · Metadata · Crawl status · Content
🎚️ Tunable concurrency — 1 to 50 parallel fetches
🐞 Debug mode — see byte length, final URL, content type
💸 Pay-per-usage pricing — no separate per-event charges

📥 Input

The form matches the official RAG Web Browser layout, plus an optional bulk URLs field.

Field	Type	Default	Description
`query`	string	`web browser for RAG pipelines -site:reddit.com`	Search keywords or a single URL.
`urls`	array	`[]`	Optional bulk URLs — skips search when set.
`maxResults`	integer	`3`	Top organic results to scrape (1–100).
`outputFormats`	array	`["markdown"]`	`text`, `markdown`, and/or `html`.
`serpProxyGroup`	string	`GOOGLE_SERP`	Proxy group for Google Search (`GOOGLE_SERP` or `SHADER`).
`serpMaxRetries`	integer	`2`	Retries when SERP fetch fails.
`proxyConfiguration`	object	`{ "useApifyProxy": true }`	Target-page proxies; auto-escalates to residential on block.
`scrapingTool`	string	`raw-http`	`raw-http` (supported) or `browser-playwright` (falls back to HTTP).
`removeElementsCssSelector`	string	(sensible default)	CSS to strip before extraction.
`htmlTransformer`	enum	`none`	`none` or `readable` (article body).
`maxRequestRetries`	integer	`1`	Target page retries (0–3).
`dynamicContentWaitSecs`	integer	`10`	For browser mode only (ignored for Raw HTTP).
`removeCookieWarnings`	boolean	`true`	Strip cookie & GDPR dialogs.
`debugMode`	boolean	`false`	Add per-page debug info.

Example input

{
  "query": "best web scraping libraries 2026",
  "maxResults": 5,
  "outputFormats": ["markdown"],
  "removeCookieWarnings": true,
  "proxyConfiguration": { "useApifyProxy": false }
}

Or scrape specific URLs:

{
  "urls": [
    "https://apify.com",
    "https://playwright.dev",
    "https://crawlee.dev"
  ],
  "outputFormats": ["markdown", "text"]
}

📤 Output

Each dataset row contains:

{
  "crawl": {
    "httpStatusCode": 200,
    "httpStatusMessage": "OK",
    "loadedAt": "2026-05-19T12:50:40.591Z",
    "uniqueKey": "21f8d32712",
    "requestStatus": "handled"
  },
  "searchResult": {
    "title": "RAG Web Browser",
    "description": "Web search and fetch tool for AI agents and RAG pipelines ...",
    "url": "https://apify.com/apify/rag-web-browser",
    "resultType": "ORGANIC",
    "rank": 1
  },
  "metadata": {
    "title": "RAG Web Browser · Apify",
    "description": "Web search and fetch tool for AI agents and RAG pipelines.",
    "languageCode": "en",
    "url": "https://apify.com/apify/rag-web-browser",
    "redirectedUrl": "https://apify.com/apify/rag-web-browser"
  },
  "query": "web browser for RAG pipelines -site:reddit.com",
  "markdown": "# RAG Web Browser\n\nWeb search and fetch tool for AI agents..."
}

The Apify Console renders the dataset with five tabs:

📋 Overview — everything at a glance
📄 Search results — rank, title, snippet, URL
📑 Page metadata — title, description, language, redirect chain
🛰️ Crawl status — HTTP code, request outcome, timestamps
📝 Extracted content — Markdown / HTML / plain text per page

🚀 How to Use (Apify Console)

Go to Apify Console → Actors.
Open this actor (or import it as a task).
Set your 🔎 Search query or paste a list of 🔗 URLs.
Pick which 📝 Output formats you want (Markdown is the default).
Click ▶ Start.
Watch the run feed — you'll see emoji-prefixed live progress: 🔎 Searching…, 📄 Page 1 → +10 new, 🔗 [3] Fetching …, ✅ [3] 200 — Title…, 📊 Progress: 5/10 (50%).
Open the 📦 Output tab to browse results by section.
Export as JSON / CSV / XLSX, or pull via the Apify API.

🤖 Use via API / Integration

REST API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "query": "vector database benchmarks 2026",
       "maxResults": 5,
       "outputFormats": ["markdown"]
     }'

Python SDK

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("<ACTOR_ID>").call(run_input={
    "query": "LangChain vs LlamaIndex",
    "maxResults": 5,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["metadata"]["title"], "→", item["markdown"][:200])

Drop-in for LangChain retrievers

from langchain.schema import Document

docs = [
    Document(page_content=item["markdown"],
             metadata={"source": item["metadata"]["url"], "rank": item["searchResult"]["rank"]})
    for item in items if item.get("markdown")
]

🛡️ How blocking & proxies are handled

You don't need to think about proxies — the actor auto-tunes:

🟢 Direct by default (fastest, cheapest).
If a site blocks us → 🟡 Datacenter proxy is engaged.
Still blocked? → 🔴 Residential proxy with up to 3 retries.
Once residential kicks in, it sticks for the rest of the run so successive pages don't fight the same wall.

All escalations are logged so you can audit them, e.g. 🛡️ Switching to residential connection (sticky) — reason: site responded with 403.

🎯 Best Use Cases

🧠 RAG pipelines — feed fresh web context to your LLM at query time
🤖 AI agents — give Claude / GPT / Gemini a real web-browsing skill
🔬 Research assistants — bulk-summarize top N results for a topic
📈 Competitive intelligence — track competitor pages on a schedule
📰 Content monitoring — convert articles to Markdown for analysis
🪄 Prompt enrichment — auto-grab fresh facts before generating text

💰 Pricing

This actor is pay-per-usage — you only pay for the Apify platform compute units (CUs) and proxy traffic it actually uses. There are no separate per-event charges.

Driver	Notes
⏱️ Compute units	Proportional to memory × runtime. Typical 10-result run = a few cents.
🛡️ Datacenter proxy	Used only if a site blocks the direct request.
🛡️ Residential proxy	Used as a last resort. Higher cost but unblocks most walls.
💾 Storage	A few KB per dataset row.

Want to lower cost further? Set maxResults lower, enable htmlTransformer: "readable", or skip html output.

❓ Frequently Asked Questions

Q: Do I need to configure a proxy myself? A: No. Start with no proxy (the default). If a site blocks the direct request, the actor automatically tries datacenter, then residential. You only need to pick a proxy explicitly if you want a specific geography.

Q: How is the Markdown produced? A: We parse HTML with selectolax (lexbor backend), strip noise via your CSS selectors, optionally isolate the article body, then convert to Markdown via markdownify with ATX-style headings.

Q: Can I scrape JavaScript-heavy sites? A: This actor uses HTTP-only fetching for maximum speed. For sites that require a full browser (heavy SPA / login flows), use a Playwright-based actor.

Q: Does it handle redirects? A: Yes — metadata.redirectedUrl captures the final URL after following redirects.

Q: What happens if half my pages succeed and half fail? A: You still get the successful ones. Each record is pushed to the dataset live, so a crash mid-run cannot wipe earlier results. Failed pages are saved with crawl.requestStatus: "failed" and the error message.

Q: Can I export results? A: Yes — JSON, CSV, XLSX, RSS, XML, HTML table, all available in the Output tab and via the Apify API.

📜 Cautions / Legal

The actor scrapes only publicly available web content.
Don't use it to scrape private, gated, or authenticated content unless you have explicit authorization.
You are responsible for legal compliance (GDPR, CCPA, site Terms of Service, robots.txt, copyright).
Be a good citizen — avoid excessive maxResults on sites you do not own or operate.

📨 Support & Feedback

Found a bug or have a feature request? Open an issue from the actor page in the Apify Console and we'll take a look. PRs welcome.

Built with 💙 on the Apify platform.

RAG Web Browser

simpleapi/rag-web-browser

SimpleAPI

RAG Web Browser

api-empire/rag-web-browser

API Empire

RAG Web Browser

scrapio/rag-web-browser

Scrapio

RAG Web Browser

scrapier/rag-web-browser

🌐 RAG Web Browser (rag-web-browser) is an intelligent tool for retrieving and generating answers from web sources with RAG. ⚡ Speed up research, get accurate citations, and streamline workflows for developers & analysts.

Scrapier

Web Search API for RAG — Search & Extract

tugelbay/rag-web-browser

Web search API that turns Google results and public URLs into Markdown, text, or HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

Tugelbay Konabayev

RAG Web Browser

travelmonitorlab/rag-web-browser

Search the web and extract content for AI/RAG pipelines. Returns clean text ready for LLM ingestion.

Travel Monitor Lab

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

NexGenData

Web Scraper Pro

autonova/web-scraper-pro

AutoNova

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

Apify

133K

4.1

RAG Web Browser Scraper

datapilot/rag-web-browser-scraper

RAG Web Browser Search & Crawl Actor uses to search Bing or crawl URLs, then extracts page content as clean markdown. It captures title, description, language, HTTP status, and structured metadata. Supports multiple queries, proxies, and outputs organized crawl + search results.