RAG Web Browser
Pricing
Pay per event
RAG Web Browser
Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!
Pricing
Pay per event
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
4
Total users
3
Monthly active users
4 days ago
Last modified
Categories
Share

🤖 RAG Web Browser
🚀 Give your LLM live web access in seconds. Search the web or fetch specific URLs and return clean Markdown with 17 metadata fields per page. No API key, no registration, no manual content cleaning.
🕒 Last updated: 2026-04-24 · 📊 17 fields per record · ⚡ 10 pages in ~6 seconds · 🔎 Search + fetch · 🧠 LLM-optimized output
The RAG Web Browser is built for retrieval-augmented generation pipelines, autonomous agents, and any workflow where an LLM needs grounded, up-to-date web content. Send a search query to get the top N results, or pass a list of URLs to fetch them directly. Every page is stripped of navigation, ads, and boilerplate, then converted to clean Markdown that feeds directly into embedding pipelines and vector databases.
Each record ships with rich metadata including title, description, author, published time, modified time, site name, Open Graph image, language, word count, and estimated reading time. Search results include a rankFromSearch field so you can weight retrieval by original engine position. Concurrent fetching keeps 10 URLs flying in parallel, so research agents stay snappy and RAG refreshes finish while your coffee is still hot.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| AI engineers, RAG builders, research agent developers, LLM app teams, content researchers, data scientists | Live RAG context, agent web browsing, knowledge base refresh, competitive intelligence, fact-grounding |
📋 What the RAG Web Browser does
Five content workflows in a single run:
- 🔎 Search mode. Pass a text query and get the top N results from DuckDuckGo with clean content for each.
- 🎯 URL mode. Provide specific URLs and the scraper fetches them in parallel.
- 📝 Clean Markdown. Strips navigation, footers, sidebars, scripts, and ads. Preserves headings, lists, blockquotes, and code blocks.
- 📊 Rich metadata. Title, description, author, publishedTime, modifiedTime, siteName, og:image, language, word count, reading time.
- 🏆 Search rank preserved. When searching, every result keeps its rank position so you can weight retrieval accordingly.
Output comes as markdown, plain text, or raw HTML. You can also request an outbound-link dump when you need to follow references.
💡 Why it matters: LLMs trained on data older than six months cannot answer questions about today's news, pricing, or product documentation. This Actor gives them a live window on the web without you having to build browser automation, proxies, or content cleaners.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to wire the output into a RAG stack.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
query | string | "" | Search query (use this OR startUrls). Engine is DuckDuckGo. |
startUrls | array of URLs | [] | Specific URLs to fetch (use this OR query). |
maxResults | integer | 10 | Search results to fetch when using query mode. |
maxItems | integer | 10 | Records returned. Free plan caps at 10, paid plan at 1,000,000. |
outputFormats | array | ["markdown","text"] | Subset of markdown, text, html. |
includeLinks | boolean | false | Include every outbound link from each page. |
Example: search mode for Claude pricing research.
{"query": "Anthropic Claude API pricing 2026","maxResults": 10,"maxItems": 10,"outputFormats": ["markdown", "text"]}
Example: fetch a list of known URLs for a RAG refresh.
{"startUrls": [{ "url": "https://docs.apify.com/platform/actors" },{ "url": "https://docs.apify.com/platform/schedules" },{ "url": "https://docs.apify.com/api/v2" }],"maxItems": 3,"outputFormats": ["markdown"],"includeLinks": true}
⚠️ Good to Know: single-page apps with heavy client-side rendering sometimes return thin content because the scraper fetches server-rendered HTML. For JavaScript-heavy sites (Notion, Gitbook, some app dashboards), pair this Actor with Website Content Crawler and its browser rendering mode.
📊 Output
Each record contains 17 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔗 url | string | "https://www.anthropic.com/pricing" |
🏷️ title | string | null | `"Claude Pricing |
📝 description | string | null | "Explore pricing for Claude models." |
📃 markdown | string | "# Claude Pricing\n\n## Opus 4.6..." |
💬 text | string | "Claude Pricing Opus 4.6..." |
🧾 html | string | null | raw HTML if requested |
🔗 links | array | null | outbound links if requested |
🔢 wordCount | number | 1240 |
⏱️ readingTimeMinutes | number | 7 |
🌍 language | string | null | "en" |
🧑 author | string | null | "Anthropic Team" |
📅 publishedTime | ISO 8601 | null | "2025-02-24T00:00:00Z" |
🔁 modifiedTime | ISO 8601 | null | "2025-03-10T00:00:00Z" |
🏢 siteName | string | null | "Anthropic" |
🖼️ imageUrl | string | null | "https://.../og.png" |
🏆 rankFromSearch | number | null | 1 |
🕒 fetchedAt | ISO 8601 | "2026-04-21T12:00:00.000Z" |
🟢 httpStatus | number | 200 |
⏱️ responseTimeMs | number | 412 |
❗ error | string | null | "Timeout" on failure |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🧠 | LLM-ready output. Markdown is clean, deterministic, and free of navigation noise. |
| 🔎 | Search or fetch. One input for search, another for direct URLs, same clean output. |
| 📊 | 17 metadata fields. Enrich retrieval with author, publishedTime, reading time, and rank. |
| ⚡ | Fast. 10 pages in about 6 seconds with concurrency of 10. |
| 🔁 | Repeatable. Same URL + same query always produces the same structured record. |
| 🚫 | No authentication. Works with public URLs and the public DuckDuckGo HTML endpoint. |
| 🔌 | Integrations. Drop into LangChain, LlamaIndex, or any tool that can consume JSON records. |
📊 Clean markdown from live web context is the fastest way to extend an LLM beyond its training cutoff. This Actor delivers it without browser automation or custom cleaners.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ RAG Web Browser (this Actor) | $5 free credit, then pay-per-use | Any public URL | Live per run | search + URL list, format picker | ⚡ 2 min |
| Paid live search APIs | $99+/month | Search results only | Real-time | Query only | ⏳ Hours |
| DIY Playwright scrapers | Free | Your code | Your schedule | Whatever you build | 🐢 Days |
| Headless browser cloud | $$$ per hour | Any URL | Live | Custom scripts | 🕒 Variable |
Pick this Actor when you want a LLM-ready web context in minutes without cloud browser billing or custom cleaner code.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the RAG Web Browser page on the Apify Store.
- 🎯 Pick a mode. Enter a search query OR a list of URLs, set
maxItems, and choose output formats. - 🚀 Run it. Click Start and let the Actor collect your content.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating RAG Web Browser
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Hourly refreshes keep a RAG pipeline grounded in fresh content.
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
Pass a search query or a list of URLs. For search mode, the Actor queries DuckDuckGo and fetches the top results in parallel. For URL mode, it fetches each URL directly. Every page is cleaned to remove navigation, ads, and boilerplate, then converted to Markdown with metadata.
📏 How accurate is the content extraction?
Very accurate for article and documentation pages. Pages that rely entirely on client-side JavaScript to render content may return thin results; pair with Website Content Crawler in browser mode for those.
🔁 Can I refresh a RAG index on a schedule?
Yes. Apify Schedules lets you run this Actor on any cron interval. Pipe the output into your vector database via webhooks or the Apify API.
🎯 Which search engine does search mode use?
DuckDuckGo HTML, which is reliable and does not require authentication. For Google-specific SERPs, use the Google Search Scraper.
⏰ Can I schedule regular runs?
Yes. Use Apify Schedules to run this Actor on any cron interval and keep your knowledge base in sync.
⚖️ Is it legal to use for RAG?
Fetching publicly available content is generally fine. Check your target sites' terms of service and robots.txt. Some publishers require attribution or block commercial reuse.
💼 Can I use this commercially?
Yes. Public web content is commonly used for RAG, research, and commercial AI products. Respect copyright and the licensing of each source.
💳 Do I need a paid Apify plan to use this Actor?
No. The free plan covers testing (10 pages per run). A paid plan lifts the limit, speeds up concurrency, and gives you access to Apify residential proxy.
🔁 What happens if a run fails or gets interrupted?
Apify retries transient errors automatically. Partial datasets from failed runs are preserved. Failed URLs include an error field so you can filter them downstream.
🧾 Does it strip scripts and ads?
Yes. Script, style, noscript, iframe, nav, footer, header, and aside tags are removed before conversion to Markdown.
🔗 Can I also get outbound links?
Yes. Set includeLinks: true and each record will include every <a href> found on the page with its text label.
🆘 What if I need help?
Our team is available through the Apify platform and the Tally form below.
🔌 Integrate with any app
RAG Web Browser connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe content into your warehouse
- GitHub - Trigger runs from commits
- Google Drive - Export content to Sheets or Docs
You can also use webhooks to push freshly fetched Markdown into vector databases and any downstream RAG stack.
🔗 Recommended Actors
- 🕸️ Website Content Crawler - Deep-crawl a domain with depth and JS rendering
- 📰 Smart Article Extractor - Extract clean article text from news sites
- 🔍 Google Search Scraper - SERP results with rank and description
- 📧 Contact Info Scraper - Emails, phones, and socials from URLs
- 📸 URL Screenshot Tool - Full-page screenshots as PNG, JPEG, or PDF
💡 Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any search engine or website. Only publicly accessible web content is fetched. Respect the robots.txt and terms of service of every site you add to the input.