RAG Web Browser
Pricing
from $4.99 / 1,000 results
RAG Web Browser
Pricing
from $4.99 / 1,000 results
Rating
0.0
(0)
Developer
API Empire
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
π RAG Web Browser β Search & Scrape for AI Agents & LLM Pipelines
One actor. Any question. Clean Markdown back. Search Google β scrape the top results β return polished Markdown / HTML / plain text β ready to drop straight into your RAG pipeline, LangChain / LlamaIndex retriever, OpenAI Assistant, Claude, Gemini, or custom AI agent.
β¨ Why Choose This Actor?
| π₯ | What you get |
|---|---|
| π | Blazing-fast async pipeline (aiohttp + selectolax + lexbor) |
| π§ | LLM-ready output β clean Markdown by default, HTML & plain text on demand |
| π‘οΈ | Smart proxy ladder β starts direct, auto-upgrades to datacenter β residential if a site blocks us |
| π | Resilient retries β 3 residential attempts before giving up |
| πΈοΈ | Bulk URLs or one search query β single input, two modes |
| πͺ | Removes cookie / GDPR banners automatically |
| π° | Readability mode β isolates article body for cleaner context |
| πΎ | Live dataset writes β partial results survive crashes |
| πͺ | Open-source friendly β Apify SDK 3.x, Python 3.13 |
π― Key Features
- π Google Search backbone β paginated, deduped, ranked results
- π Direct URL mode β paste a list of URLs and skip search entirely
- π§Ή Custom CSS scrub β strip nav, footer, scripts, modals, ads, β¦
- π Per-page metadata β title, description, language, redirect chain
- π’ Per-section dataset views β Results Β· Metadata Β· Crawl status Β· Content
- ποΈ Tunable concurrency β 1 to 50 parallel fetches
- π Debug mode β see byte length, final URL, content type
- πΈ Pay-per-usage pricing β no separate per-event charges
π₯ Input
The form matches the official RAG Web Browser layout, plus an optional bulk URLs field.
| Field | Type | Default | Description |
|---|---|---|---|
query | string | web browser for RAG pipelines -site:reddit.com | Search keywords or a single URL. |
urls | array | [] | Optional bulk URLs β skips search when set. |
maxResults | integer | 3 | Top organic results to scrape (1β100). |
outputFormats | array | ["markdown"] | text, markdown, and/or html. |
serpProxyGroup | string | GOOGLE_SERP | Proxy group for Google Search (GOOGLE_SERP or SHADER). |
serpMaxRetries | integer | 2 | Retries when SERP fetch fails. |
proxyConfiguration | object | { "useApifyProxy": true } | Target-page proxies; auto-escalates to residential on block. |
scrapingTool | string | raw-http | raw-http (supported) or browser-playwright (falls back to HTTP). |
removeElementsCssSelector | string | (sensible default) | CSS to strip before extraction. |
htmlTransformer | enum | none | none or readable (article body). |
maxRequestRetries | integer | 1 | Target page retries (0β3). |
dynamicContentWaitSecs | integer | 10 | For browser mode only (ignored for Raw HTTP). |
removeCookieWarnings | boolean | true | Strip cookie & GDPR dialogs. |
debugMode | boolean | false | Add per-page debug info. |
Example input
{"query": "best web scraping libraries 2026","maxResults": 5,"outputFormats": ["markdown"],"removeCookieWarnings": true,"proxyConfiguration": { "useApifyProxy": false }}
Or scrape specific URLs:
{"urls": ["https://apify.com","https://playwright.dev","https://crawlee.dev"],"outputFormats": ["markdown", "text"]}
π€ Output
Each dataset row contains:
{"crawl": {"httpStatusCode": 200,"httpStatusMessage": "OK","loadedAt": "2026-05-19T12:50:40.591Z","uniqueKey": "21f8d32712","requestStatus": "handled"},"searchResult": {"title": "RAG Web Browser","description": "Web search and fetch tool for AI agents and RAG pipelines ...","url": "https://apify.com/apify/rag-web-browser","resultType": "ORGANIC","rank": 1},"metadata": {"title": "RAG Web Browser Β· Apify","description": "Web search and fetch tool for AI agents and RAG pipelines.","languageCode": "en","url": "https://apify.com/apify/rag-web-browser","redirectedUrl": "https://apify.com/apify/rag-web-browser"},"query": "web browser for RAG pipelines -site:reddit.com","markdown": "# RAG Web Browser\n\nWeb search and fetch tool for AI agents..."}
The Apify Console renders the dataset with five tabs:
- π Overview β everything at a glance
- π Search results β rank, title, snippet, URL
- π Page metadata β title, description, language, redirect chain
- π°οΈ Crawl status β HTTP code, request outcome, timestamps
- π Extracted content β Markdown / HTML / plain text per page
π How to Use (Apify Console)
- Go to Apify Console β Actors.
- Open this actor (or import it as a task).
- Set your π Search query or paste a list of π URLs.
- Pick which π Output formats you want (Markdown is the default).
- Click βΆ Start.
- Watch the run feed β you'll see emoji-prefixed live progress:
π Searchingβ¦,π Page 1 β +10 new,π [3] Fetching β¦,β [3] 200 β Titleβ¦,π Progress: 5/10 (50%). - Open the π¦ Output tab to browse results by section.
- Export as JSON / CSV / XLSX, or pull via the Apify API.
π€ Use via API / Integration
REST API
curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"query": "vector database benchmarks 2026","maxResults": 5,"outputFormats": ["markdown"]}'
Python SDK
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("<ACTOR_ID>").call(run_input={"query": "LangChain vs LlamaIndex","maxResults": 5,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["metadata"]["title"], "β", item["markdown"][:200])
Drop-in for LangChain retrievers
from langchain.schema import Documentdocs = [Document(page_content=item["markdown"],metadata={"source": item["metadata"]["url"], "rank": item["searchResult"]["rank"]})for item in items if item.get("markdown")]
π‘οΈ How blocking & proxies are handled
You don't need to think about proxies β the actor auto-tunes:
- π’ Direct by default (fastest, cheapest).
- If a site blocks us β π‘ Datacenter proxy is engaged.
- Still blocked? β π΄ Residential proxy with up to 3 retries.
- Once residential kicks in, it sticks for the rest of the run so successive pages don't fight the same wall.
All escalations are logged so you can audit them, e.g. π‘οΈ Switching to residential connection (sticky) β reason: site responded with 403.
π― Best Use Cases
- π§ RAG pipelines β feed fresh web context to your LLM at query time
- π€ AI agents β give Claude / GPT / Gemini a real web-browsing skill
- π¬ Research assistants β bulk-summarize top N results for a topic
- π Competitive intelligence β track competitor pages on a schedule
- π° Content monitoring β convert articles to Markdown for analysis
- πͺ Prompt enrichment β auto-grab fresh facts before generating text
π° Pricing
This actor is pay-per-usage β you only pay for the Apify platform compute units (CUs) and proxy traffic it actually uses. There are no separate per-event charges.
| Driver | Notes |
|---|---|
| β±οΈ Compute units | Proportional to memory Γ runtime. Typical 10-result run = a few cents. |
| π‘οΈ Datacenter proxy | Used only if a site blocks the direct request. |
| π‘οΈ Residential proxy | Used as a last resort. Higher cost but unblocks most walls. |
| πΎ Storage | A few KB per dataset row. |
Want to lower cost further? Set
maxResultslower, enablehtmlTransformer: "readable", or skiphtmloutput.
β Frequently Asked Questions
Q: Do I need to configure a proxy myself? A: No. Start with no proxy (the default). If a site blocks the direct request, the actor automatically tries datacenter, then residential. You only need to pick a proxy explicitly if you want a specific geography.
Q: How is the Markdown produced? A: We parse HTML with selectolax (lexbor backend), strip noise via your CSS selectors, optionally isolate the article body, then convert to Markdown via markdownify with ATX-style headings.
Q: Can I scrape JavaScript-heavy sites? A: This actor uses HTTP-only fetching for maximum speed. For sites that require a full browser (heavy SPA / login flows), use a Playwright-based actor.
Q: Does it handle redirects?
A: Yes β metadata.redirectedUrl captures the final URL after following redirects.
Q: What happens if half my pages succeed and half fail?
A: You still get the successful ones. Each record is pushed to the dataset live, so a crash mid-run cannot wipe earlier results. Failed pages are saved with crawl.requestStatus: "failed" and the error message.
Q: Can I export results? A: Yes β JSON, CSV, XLSX, RSS, XML, HTML table, all available in the Output tab and via the Apify API.
π Cautions / Legal
- The actor scrapes only publicly available web content.
- Don't use it to scrape private, gated, or authenticated content unless you have explicit authorization.
- You are responsible for legal compliance (GDPR, CCPA, site Terms of Service, robots.txt, copyright).
- Be a good citizen β avoid excessive
maxResultson sites you do not own or operate.
π¨ Support & Feedback
Found a bug or have a feature request? Open an issue from the actor page in the Apify Console and we'll take a look. PRs welcome.
Built with π on the Apify platform.