RAG Web Browser avatar

RAG Web Browser

Pricing

from $4.99 / 1,000 results

Go to Apify Store
RAG Web Browser

RAG Web Browser

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Scraper Engine

Scraper Engine

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

🌐 RAG Web Browser β€” Search & Scrape for AI Agents & LLM Pipelines

One actor. Any question. Clean Markdown back. Search Google β†’ scrape the top results β†’ return polished Markdown / HTML / plain text β€” ready to drop straight into your RAG pipeline, LangChain / LlamaIndex retriever, OpenAI Assistant, Claude, Gemini, or custom AI agent.

Apify Actor


✨ Why Choose This Actor?

πŸ”₯What you get
πŸš€Blazing-fast async pipeline (aiohttp + selectolax + lexbor)
🧠LLM-ready output β€” clean Markdown by default, HTML & plain text on demand
πŸ›‘οΈSmart proxy ladder β€” starts direct, auto-upgrades to datacenter β†’ residential if a site blocks us
πŸ”Resilient retries β€” 3 residential attempts before giving up
πŸ•ΈοΈBulk URLs or one search query β€” single input, two modes
πŸͺRemoves cookie / GDPR banners automatically
πŸ“°Readability mode β€” isolates article body for cleaner context
πŸ’ΎLive dataset writes β€” partial results survive crashes
πŸͺŸOpen-source friendly β€” Apify SDK 3.x, Python 3.13

🎯 Key Features

  • πŸ” Google Search backbone β€” paginated, deduped, ranked results
  • 🌐 Direct URL mode β€” paste a list of URLs and skip search entirely
  • 🧹 Custom CSS scrub β€” strip nav, footer, scripts, modals, ads, …
  • πŸ“‘ Per-page metadata β€” title, description, language, redirect chain
  • πŸ”’ Per-section dataset views β€” Results Β· Metadata Β· Crawl status Β· Content
  • 🎚️ Tunable concurrency β€” 1 to 50 parallel fetches
  • 🐞 Debug mode β€” see byte length, final URL, content type
  • πŸ’Έ Pay-per-usage pricing β€” no separate per-event charges

πŸ“₯ Input

The form matches the official RAG Web Browser layout, plus an optional bulk URLs field.

FieldTypeDefaultDescription
querystringweb browser for RAG pipelines -site:reddit.comSearch keywords or a single URL.
urlsarray[]Optional bulk URLs β€” skips search when set.
maxResultsinteger3Top organic results to scrape (1–100).
outputFormatsarray["markdown"]text, markdown, and/or html.
serpProxyGroupstringGOOGLE_SERPProxy group for Google Search (GOOGLE_SERP or SHADER).
serpMaxRetriesinteger2Retries when SERP fetch fails.
proxyConfigurationobject{ "useApifyProxy": true }Target-page proxies; auto-escalates to residential on block.
scrapingToolstringraw-httpraw-http (supported) or browser-playwright (falls back to HTTP).
removeElementsCssSelectorstring(sensible default)CSS to strip before extraction.
htmlTransformerenumnonenone or readable (article body).
maxRequestRetriesinteger1Target page retries (0–3).
dynamicContentWaitSecsinteger10For browser mode only (ignored for Raw HTTP).
removeCookieWarningsbooleantrueStrip cookie & GDPR dialogs.
debugModebooleanfalseAdd per-page debug info.

Example input

{
"query": "best web scraping libraries 2026",
"maxResults": 5,
"outputFormats": ["markdown"],
"removeCookieWarnings": true,
"proxyConfiguration": { "useApifyProxy": false }
}

Or scrape specific URLs:

{
"urls": [
"https://apify.com",
"https://playwright.dev",
"https://crawlee.dev"
],
"outputFormats": ["markdown", "text"]
}

πŸ“€ Output

Each dataset row contains:

{
"crawl": {
"httpStatusCode": 200,
"httpStatusMessage": "OK",
"loadedAt": "2026-05-19T12:50:40.591Z",
"uniqueKey": "21f8d32712",
"requestStatus": "handled"
},
"searchResult": {
"title": "RAG Web Browser",
"description": "Web search and fetch tool for AI agents and RAG pipelines ...",
"url": "https://apify.com/apify/rag-web-browser",
"resultType": "ORGANIC",
"rank": 1
},
"metadata": {
"title": "RAG Web Browser Β· Apify",
"description": "Web search and fetch tool for AI agents and RAG pipelines.",
"languageCode": "en",
"url": "https://apify.com/apify/rag-web-browser",
"redirectedUrl": "https://apify.com/apify/rag-web-browser"
},
"query": "web browser for RAG pipelines -site:reddit.com",
"markdown": "# RAG Web Browser\n\nWeb search and fetch tool for AI agents..."
}

The Apify Console renders the dataset with five tabs:

  • πŸ“‹ Overview β€” everything at a glance
  • πŸ“„ Search results β€” rank, title, snippet, URL
  • πŸ“‘ Page metadata β€” title, description, language, redirect chain
  • πŸ›°οΈ Crawl status β€” HTTP code, request outcome, timestamps
  • πŸ“ Extracted content β€” Markdown / HTML / plain text per page

πŸš€ How to Use (Apify Console)

  1. Go to Apify Console β†’ Actors.
  2. Open this actor (or import it as a task).
  3. Set your πŸ”Ž Search query or paste a list of πŸ”— URLs.
  4. Pick which πŸ“ Output formats you want (Markdown is the default).
  5. Click β–Ά Start.
  6. Watch the run feed β€” you'll see emoji-prefixed live progress: πŸ”Ž Searching…, πŸ“„ Page 1 β†’ +10 new, πŸ”— [3] Fetching …, βœ… [3] 200 β€” Title…, πŸ“Š Progress: 5/10 (50%).
  7. Open the πŸ“¦ Output tab to browse results by section.
  8. Export as JSON / CSV / XLSX, or pull via the Apify API.

πŸ€– Use via API / Integration

REST API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "vector database benchmarks 2026",
"maxResults": 5,
"outputFormats": ["markdown"]
}'

Python SDK

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("<ACTOR_ID>").call(run_input={
"query": "LangChain vs LlamaIndex",
"maxResults": 5,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["metadata"]["title"], "β†’", item["markdown"][:200])

Drop-in for LangChain retrievers

from langchain.schema import Document
docs = [
Document(page_content=item["markdown"],
metadata={"source": item["metadata"]["url"], "rank": item["searchResult"]["rank"]})
for item in items if item.get("markdown")
]

πŸ›‘οΈ How blocking & proxies are handled

You don't need to think about proxies β€” the actor auto-tunes:

  1. 🟒 Direct by default (fastest, cheapest).
  2. If a site blocks us β†’ 🟑 Datacenter proxy is engaged.
  3. Still blocked? β†’ πŸ”΄ Residential proxy with up to 3 retries.
  4. Once residential kicks in, it sticks for the rest of the run so successive pages don't fight the same wall.

All escalations are logged so you can audit them, e.g. πŸ›‘οΈ Switching to residential connection (sticky) β€” reason: site responded with 403.


🎯 Best Use Cases

  • 🧠 RAG pipelines β€” feed fresh web context to your LLM at query time
  • πŸ€– AI agents β€” give Claude / GPT / Gemini a real web-browsing skill
  • πŸ”¬ Research assistants β€” bulk-summarize top N results for a topic
  • πŸ“ˆ Competitive intelligence β€” track competitor pages on a schedule
  • πŸ“° Content monitoring β€” convert articles to Markdown for analysis
  • πŸͺ„ Prompt enrichment β€” auto-grab fresh facts before generating text

πŸ’° Pricing

This actor is pay-per-usage β€” you only pay for the Apify platform compute units (CUs) and proxy traffic it actually uses. There are no separate per-event charges.

DriverNotes
⏱️ Compute unitsProportional to memory Γ— runtime. Typical 10-result run = a few cents.
πŸ›‘οΈ Datacenter proxyUsed only if a site blocks the direct request.
πŸ›‘οΈ Residential proxyUsed as a last resort. Higher cost but unblocks most walls.
πŸ’Ύ StorageA few KB per dataset row.

Want to lower cost further? Set maxResults lower, enable htmlTransformer: "readable", or skip html output.


❓ Frequently Asked Questions

Q: Do I need to configure a proxy myself? A: No. Start with no proxy (the default). If a site blocks the direct request, the actor automatically tries datacenter, then residential. You only need to pick a proxy explicitly if you want a specific geography.

Q: How is the Markdown produced? A: We parse HTML with selectolax (lexbor backend), strip noise via your CSS selectors, optionally isolate the article body, then convert to Markdown via markdownify with ATX-style headings.

Q: Can I scrape JavaScript-heavy sites? A: This actor uses HTTP-only fetching for maximum speed. For sites that require a full browser (heavy SPA / login flows), use a Playwright-based actor.

Q: Does it handle redirects? A: Yes β€” metadata.redirectedUrl captures the final URL after following redirects.

Q: What happens if half my pages succeed and half fail? A: You still get the successful ones. Each record is pushed to the dataset live, so a crash mid-run cannot wipe earlier results. Failed pages are saved with crawl.requestStatus: "failed" and the error message.

Q: Can I export results? A: Yes β€” JSON, CSV, XLSX, RSS, XML, HTML table, all available in the Output tab and via the Apify API.


  • The actor scrapes only publicly available web content.
  • Don't use it to scrape private, gated, or authenticated content unless you have explicit authorization.
  • You are responsible for legal compliance (GDPR, CCPA, site Terms of Service, robots.txt, copyright).
  • Be a good citizen β€” avoid excessive maxResults on sites you do not own or operate.

πŸ“¨ Support & Feedback

Found a bug or have a feature request? Open an issue from the actor page in the Apify Console and we'll take a look. PRs welcome.


Built with πŸ’™ on the Apify platform.