Pricing

from $1.00 / 1,000 results

RAG Web Browser

Search the web or fetch direct URLs and return clean markdown for LLM/RAG pipelines. filters: domainAllowlist/Blocklist, minTextLength, keywordsAnyOf. No login, no cookies.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What this actor does

Sends a query to Google → extracts top organic results → fetches each → cleans HTML → emits structured records
Or fetches direct URLs (startUrls) and skips Google entirely
Strips boilerplate (nav, footer, ads, scripts) before extracting the main article
Outputs markdown (default), plain text, and/or raw html per page
Returns one clean record per page with title, description, language, word count, reading time

Output per page

url, loadedUrl, domain
title, description, languageCode
text (plain text) — when requested
markdown (LLM-ready) — when requested
html (raw cleaned HTML) — when requested
wordCount, readingTimeMinutes (220 wpm)
httpStatusCode, loadedTime (seconds)
searchRank (1-based — when the URL came from Google search)
recordType: "page", scrapedAt

Empty fields are omitted from the output (no nulls).

Input

Field	Type	Default	Description
`query`	string	`"what is retrieval augmented generation"`	Search query OR a single URL
`startUrls`	array	`[]`	Direct URLs to fetch (skips Google search)
`maxResults`	int	`3`	Number of top organic Google results to fetch (1–100)
`outputFormats`	array	`["markdown"]`	Any combination of `markdown`, `text`, `html`
`requestTimeoutSecs`	int	`40`	Per-page HTTP timeout (1–300s)
`scrapingTool`	enum	`raw-http`	`raw-http` (curl_cffi) or `browser-playwright`
`removeElementsCssSelector`	string	nav/footer/aside/script/...	CSS selector(s) to strip before extraction
`htmlTransformer`	enum	`readable-text`	`readable-text` (main article) or `none`
`desiredConcurrency`	int	`5`	Parallel fetches (0 = auto)
`maxRequestRetries`	int	`2`	Retries on transient HTTP failures
`dynamicContentWaitSecs`	int	`5`	Wait time for JS content (browser mode only)
`removeCookieWarnings`	bool	`true`	Dismiss cookie/consent dialogs (browser mode)
`useApifyProxy`	bool	`true`	Route requests through Apify proxy
`domainAllowlist`	array	`[]`	Only emit pages whose host contains one of these substrings
`domainBlocklist`	array	`[]`	Drop pages whose host contains one of these substrings
`minTextLength`	int	–	Drop pages with fewer than N characters of extracted text
`excludeContentSelectors`	array	`[]`	Additional CSS selectors to strip
`keywordsAnyOf`	array	`[]`	Only emit pages containing at least one of these keywords

Example: search query

{
  "query": "best vector database for RAG 2024",
  "maxResults": 5,
  "outputFormats": ["markdown"],
  "minTextLength": 500,
  "domainBlocklist": ["pinterest.com", "youtube.com"]
}

Example: direct URLs

{
  "startUrls": [
    "https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
    "https://docs.langchain.com/docs/use-cases/qa/"
  ],
  "outputFormats": ["markdown", "text"],
  "htmlTransformer": "readable-text"
}

Example: filter for relevance

{
  "query": "vector embeddings tutorial",
  "maxResults": 10,
  "keywordsAnyOf": ["embedding", "vector", "similarity"],
  "minTextLength": 1000,
  "outputFormats": ["markdown"]
}

Use cases

RAG ingestion — pull fresh top-N Google results for a topic, hand markdown to your embedder
News briefings — daily query like "AI news today", filter by minTextLength to drop SEO thin pages
Competitive monitoring — domainAllowlist of competitor domains, scrape their blogs weekly
Reference enrichment — feed each citation URL from a paper into the actor for clean text extraction
LLM context — give an LLM the cleaned markdown of pages, not raw HTML, to save tokens

FAQ

Does it require a login or cookies? No. All fetches are anonymous.

Is a proxy needed? Apify proxy is enabled by default to avoid Google rate-limits and unblock some target sites. You can disable it with useApifyProxy: false.

What's the difference between raw-http and browser-playwright? Raw HTTP uses curl_cffi with chrome131 TLS impersonation — fast and works on ~80% of sites. Browser mode runs headless Chromium, waits for JS, and dismisses cookie banners — slower but handles SPAs.

Why is description missing on some pages? Some pages don't expose a <meta name="description"> or og:description. The omit-empty contract drops missing fields rather than emit nulls.

Why does markdown look stripped down? We intentionally output simple markdown (headings, lists, links, emphasis, code) — RAG embedders strip most formatting anyway, and simpler markdown reduces token bloat.

What if all my filters reject every result? The actor finishes cleanly with a status message instead of pushing placeholder rows.

How do I use this with my LLM/RAG pipeline? Trigger this actor from your indexing job, read the dataset (each record has url + markdown), embed the markdown, store in your vector DB.

RAG Web Browser

travelmonitorlab/rag-web-browser

Search the web and extract content for AI/RAG pipelines. Returns clean text ready for LLM ingestion.

Travel Monitor Lab

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

NexGenData

RAG Web Browser

apify/rag-web-browser

Web search and fetch tool for AI agents and RAG pipelines. It queries Google Search, scrapes the top N pages using a full web browser, and returns their content as clean Markdown for further processing by an LLM. Can also fetch individual URLs.

Apify

135K

4.1

RAG Web Browser

simpleapi/rag-web-browser

SimpleAPI

RAG Web Browser

api-empire/rag-web-browser

API Empire

RAG Web Browser

scrapio/rag-web-browser

Scrapio

Web Search API for RAG — Search & Extract

tugelbay/rag-web-browser

Web search API that turns Google results and public URLs into Markdown, text, or HTML for RAG and AI agents. Guide: https://konabayev.com/tools/rag-web-browser/?utm_source=apify_info&utm_medium=referral&utm_campaign=rag-web-browser

Tugelbay Konabayev

RAG Web Browser

scrapier/rag-web-browser

🌐 RAG Web Browser (rag-web-browser) is an intelligent tool for retrieving and generating answers from web sources with RAG. ⚡ Speed up research, get accurate citations, and streamline workflows for developers & analysts.

Scrapier

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.