Pricing

Pay per event

RAG Web Browser

Give your AI agents real-time web access! Search the web on any topic and get full page content as clean Markdown, ready for LLMs, RAG pipelines, or OpenAI Assistants. Includes titles, descriptions, links, authors, images, and metadata. Start grounding your AI with fresh data in minutes!

Pricing

Pay per event

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

🤖 RAG Web Browser

🚀 Give your LLM live web access in seconds. Search the web or fetch specific URLs and return clean Markdown with 17 metadata fields per page. No API key, no registration, no manual content cleaning.

🕒 Last updated: 2026-04-24 · 📊 17 fields per record · ⚡ 10 pages in ~6 seconds · 🔎 Search + fetch · 🧠 LLM-optimized output

The RAG Web Browser is built for retrieval-augmented generation pipelines, autonomous agents, and any workflow where an LLM needs grounded, up-to-date web content. Send a search query to get the top N results, or pass a list of URLs to fetch them directly. Every page is stripped of navigation, ads, and boilerplate, then converted to clean Markdown that feeds directly into embedding pipelines and vector databases.

Each record ships with rich metadata including title, description, author, published time, modified time, site name, Open Graph image, language, word count, and estimated reading time. Search results include a rankFromSearch field so you can weight retrieval by original engine position. Concurrent fetching keeps 10 URLs flying in parallel, so research agents stay snappy and RAG refreshes finish while your coffee is still hot.

🎯 Target Audience	💡 Primary Use Cases
AI engineers, RAG builders, research agent developers, LLM app teams, content researchers, data scientists	Live RAG context, agent web browsing, knowledge base refresh, competitive intelligence, fact-grounding

📋 What the RAG Web Browser does

Five content workflows in a single run:

🔎 Search mode. Pass a text query and get the top N results from DuckDuckGo with clean content for each.
🎯 URL mode. Provide specific URLs and the scraper fetches them in parallel.
📝 Clean Markdown. Strips navigation, footers, sidebars, scripts, and ads. Preserves headings, lists, blockquotes, and code blocks.
📊 Rich metadata. Title, description, author, publishedTime, modifiedTime, siteName, og:image, language, word count, reading time.
🏆 Search rank preserved. When searching, every result keeps its rank position so you can weight retrieval accordingly.

Output comes as markdown, plain text, or raw HTML. You can also request an outbound-link dump when you need to follow references.

💡 Why it matters: LLMs trained on data older than six months cannot answer questions about today's news, pricing, or product documentation. This Actor gives them a live window on the web without you having to build browser automation, proxies, or content cleaners.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to wire the output into a RAG stack.

⚙️ Input

Input	Type	Default	Behavior
query	string	""	Search query (use this OR startUrls). Engine is DuckDuckGo.
startUrls	array of URLs	[]	Specific URLs to fetch (use this OR query).
maxResults	integer	10	Search results to fetch when using query mode.
maxItems	integer	10	Records returned. Free plan caps at 10, paid plan at 1,000,000.
outputFormats	array	["markdown","text"]	Subset of markdown, text, html.
includeLinks	boolean	false	Include every outbound link from each page.

Example: search mode for Claude pricing research.

{
    "query": "Anthropic Claude API pricing 2026",
    "maxResults": 10,
    "maxItems": 10,
    "outputFormats": ["markdown", "text"]
}

Example: fetch a list of known URLs for a RAG refresh.

{
    "startUrls": [
        { "url": "https://docs.apify.com/platform/actors" },
        { "url": "https://docs.apify.com/platform/schedules" },
        { "url": "https://docs.apify.com/api/v2" }
    ],
    "maxItems": 3,
    "outputFormats": ["markdown"],
    "includeLinks": true
}

⚠️ Good to Know: single-page apps with heavy client-side rendering sometimes return thin content because the scraper fetches server-rendered HTML. For JavaScript-heavy sites (Notion, Gitbook, some app dashboards), pair this Actor with Website Content Crawler and its browser rendering mode.

📊 Output

Each record contains 17 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

Field	Type	Example
🔗 `url`	string	`"https://www.anthropic.com/pricing"`
🏷️ `title`	string \| null	`"Claude Pricing
📝 `description`	string \| null	`"Explore pricing for Claude models."`
📃 `markdown`	string	`"# Claude Pricing\n\n## Opus 4.6..."`
💬 `text`	string	`"Claude Pricing Opus 4.6..."`
🧾 `html`	string \| null	raw HTML if requested
🔗 `links`	array \| null	outbound links if requested
🔢 `wordCount`	number	`1240`
⏱️ `readingTimeMinutes`	number	`7`
🌍 `language`	string \| null	`"en"`
🧑 `author`	string \| null	`"Anthropic Team"`
📅 `publishedTime`	ISO 8601 \| null	`"2025-02-24T00:00:00Z"`
🔁 `modifiedTime`	ISO 8601 \| null	`"2025-03-10T00:00:00Z"`
🏢 `siteName`	string \| null	`"Anthropic"`
🖼️ `imageUrl`	string \| null	`"https://.../og.png"`
🏆 `rankFromSearch`	number \| null	`1`
🕒 `fetchedAt`	ISO 8601	`"2026-04-21T12:00:00.000Z"`
🟢 `httpStatus`	number	`200`
⏱️ `responseTimeMs`	number	`412`
❗ `error`	string \| null	`"Timeout"` on failure

📦 Sample records

✨ Why choose this Actor

	Capability
🧠	LLM-ready output. Markdown is clean, deterministic, and free of navigation noise.
🔎	Search or fetch. One input for search, another for direct URLs, same clean output.
📊	17 metadata fields. Enrich retrieval with author, publishedTime, reading time, and rank.
⚡	Fast. 10 pages in about 6 seconds with concurrency of 10.
🔁	Repeatable. Same URL + same query always produces the same structured record.
🚫	No authentication. Works with public URLs and the public DuckDuckGo HTML endpoint.
🔌	Integrations. Drop into LangChain, LlamaIndex, or any tool that can consume JSON records.

📊 Clean markdown from live web context is the fastest way to extend an LLM beyond its training cutoff. This Actor delivers it without browser automation or custom cleaners.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
⭐ RAG Web Browser (this Actor)	$5 free credit, then pay-per-use	Any public URL	Live per run	search + URL list, format picker	⚡ 2 min
Paid live search APIs	$99+/month	Search results only	Real-time	Query only	⏳ Hours
DIY Playwright scrapers	Free	Your code	Your schedule	Whatever you build	🐢 Days
Headless browser cloud	$$$ per hour	Any URL	Live	Custom scripts	🕒 Variable

Pick this Actor when you want a LLM-ready web context in minutes without cloud browser billing or custom cleaner code.

🚀 How to use

📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
🌐 Open the Actor. Go to the RAG Web Browser page on the Apify Store.
🎯 Pick a mode. Enter a search query OR a list of URLs, set maxItems, and choose output formats.
🚀 Run it. Click Start and let the Actor collect your content.
📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.

💼 Business use cases

🧠 AI Engineering & RAG

Keep vector databases current with live content
Build research agents that cite real sources
Ground chatbots in documentation at run time
Power context retrieval for customer support

📚 Knowledge Management

Scheduled refresh of internal wikis
Sync external docs into a searchable index
Collect press releases and product announcements
Back up external blog posts with attribution

📊 Competitive Intelligence

Monitor competitor blogs and pricing pages
Track announcements across an industry
Watch for product changes on rival sites
Feed curated sources into an analyst LLM

🧑‍💻 Developer Tooling

Augment coding assistants with fresh docs
Add current library release notes to prompts
Keep CLI help text in sync with docs sites
Pipe issue trackers into an LLM workflow

🔌 Automating RAG Web Browser

Control the scraper programmatically for scheduled runs and pipeline integrations:

🟢 Node.js. Install the apify-client NPM package.
🐍 Python. Use the apify-client PyPI package.
📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Hourly refreshes keep a RAG pipeline grounded in fresh content.

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Empirical datasets for papers, thesis work, and coursework
Longitudinal studies tracking changes across snapshots
Reproducible research with cited, versioned data pulls
Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

Side projects, portfolio demos, and indie app launches
Data visualizations, dashboards, and infographics
Content research for bloggers, YouTubers, and podcasters
Hobbyist collections and personal trackers

🤝 Non-profit and civic

Transparency reporting and accountability projects
Advocacy campaigns backed by public-interest data
Community-run databases for local issues
Investigative journalism on public records

🧪 Experimentation

Prototype AI and machine-learning pipelines with real data
Validate product-market hypotheses before engineering spend
Train small domain-specific models on niche corpora
Test dashboard concepts with live input

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

🧩 How does it work?

Pass a search query or a list of URLs. For search mode, the Actor queries DuckDuckGo and fetches the top results in parallel. For URL mode, it fetches each URL directly. Every page is cleaned to remove navigation, ads, and boilerplate, then converted to Markdown with metadata.

📏 How accurate is the content extraction?

Very accurate for article and documentation pages. Pages that rely entirely on client-side JavaScript to render content may return thin results; pair with Website Content Crawler in browser mode for those.

🔁 Can I refresh a RAG index on a schedule?

Yes. Apify Schedules lets you run this Actor on any cron interval. Pipe the output into your vector database via webhooks or the Apify API.

🎯 Which search engine does search mode use?

DuckDuckGo HTML, which is reliable and does not require authentication. For Google-specific SERPs, use the Google Search Scraper.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval and keep your knowledge base in sync.

⚖️ Is it legal to use for RAG?

Fetching publicly available content is generally fine. Check your target sites' terms of service and robots.txt. Some publishers require attribution or block commercial reuse.

💼 Can I use this commercially?

Yes. Public web content is commonly used for RAG, research, and commercial AI products. Respect copyright and the licensing of each source.

💳 Do I need a paid Apify plan to use this Actor?

No. The free plan covers testing (10 pages per run). A paid plan lifts the limit, speeds up concurrency, and gives you access to Apify residential proxy.

🔁 What happens if a run fails or gets interrupted?

Apify retries transient errors automatically. Partial datasets from failed runs are preserved. Failed URLs include an error field so you can filter them downstream.

🧾 Does it strip scripts and ads?

Yes. Script, style, noscript, iframe, nav, footer, header, and aside tags are removed before conversion to Markdown.

🔗 Can I also get outbound links?

Yes. Set includeLinks: true and each record will include every <a href> found on the page with its text label.

🆘 What if I need help?

Our team is available through the Apify platform and the Tally form below.

🔌 Integrate with any app

RAG Web Browser connects to any cloud service via Apify integrations:

Make - Automate multi-step workflows
Zapier - Connect with 5,000+ apps
Slack - Get run notifications in your channels
Airbyte - Pipe content into your warehouse
GitHub - Trigger runs from commits
Google Drive - Export content to Sheets or Docs

You can also use webhooks to push freshly fetched Markdown into vector databases and any downstream RAG stack.

🔗 Recommended Actors

🕸️ Website Content Crawler - Deep-crawl a domain with depth and JS rendering
📰 Smart Article Extractor - Extract clean article text from news sites
🔍 Google Search Scraper - SERP results with rank and description
📧 Contact Info Scraper - Emails, phones, and socials from URLs
📸 URL Screenshot Tool - Full-page screenshots as PNG, JPEG, or PDF

💡 Pro Tip: browse the complete ParseForge collection for more AI-ready web tools.

🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.

⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any search engine or website. Only publicly accessible web content is fetched. Respect the robots.txt and terms of service of every site you add to the input.

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

NexGenData

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

Apify

95K

3.7

Web Page to Clean Markdown

consistent_tradition/web-to-markdown

Extracts clean Markdown text from any web page. Perfect for AI/RAG datasets, research corpora, and content analysis.

Peter PANG

RAG Web Browser API - Search & Extract

tugelbay/rag-web-browser

Google search + public URLs to Markdown/text/HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

Tugelbay Konabayev

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Web Search for AI (DuckDuckGo)

desmond-dev/duckduckgo-web-search

Perform anonymous web searches and extract clean results (Title, Link, Snippet). No API key required. Perfect for RAG pipelines, grounding LLMs, and market research.

Desmond Chigariro

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.