RAG Web Browser avatar

RAG Web Browser

Pricing

from $1.00 / 1,000 results

Go to Apify Store
RAG Web Browser

RAG Web Browser

Search the web or fetch direct URLs and return clean markdown for LLM/RAG pipelines. filters: domainAllowlist/Blocklist, minTextLength, keywordsAnyOf. No login, no cookies.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(13)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

6

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Search the web (or fetch direct URLs) and return clean markdown ready for LLM/RAG pipelines. HTTP-first with chrome131 TLS impersonation; Playwright fallback when needed. Pro filters narrow the result set to exactly what your retrieval index needs.

What this actor does

  • Sends a query to Google → extracts top organic results → fetches each → cleans HTML → emits structured records
  • Or fetches direct URLs (startUrls) and skips Google entirely
  • Strips boilerplate (nav, footer, ads, scripts) before extracting the main article
  • Outputs markdown (default), plain text, and/or raw html per page
  • Returns one clean record per page with title, description, language, word count, reading time

Output per page

  • url, loadedUrl, domain
  • title, description, languageCode
  • text (plain text) — when requested
  • markdown (LLM-ready) — when requested
  • html (raw cleaned HTML) — when requested
  • wordCount, readingTimeMinutes (220 wpm)
  • httpStatusCode, loadedTime (seconds)
  • searchRank (1-based — when the URL came from Google search)
  • recordType: "page", scrapedAt

Empty fields are omitted from the output (no nulls).

Input

FieldTypeDefaultDescription
querystring"what is retrieval augmented generation"Search query OR a single URL
startUrlsarray[]Direct URLs to fetch (skips Google search)
maxResultsint3Number of top organic Google results to fetch (1–100)
outputFormatsarray["markdown"]Any combination of markdown, text, html
requestTimeoutSecsint40Per-page HTTP timeout (1–300s)
scrapingToolenumraw-httpraw-http (curl_cffi) or browser-playwright
removeElementsCssSelectorstringnav/footer/aside/script/...CSS selector(s) to strip before extraction
htmlTransformerenumreadable-textreadable-text (main article) or none
desiredConcurrencyint5Parallel fetches (0 = auto)
maxRequestRetriesint2Retries on transient HTTP failures
dynamicContentWaitSecsint5Wait time for JS content (browser mode only)
removeCookieWarningsbooltrueDismiss cookie/consent dialogs (browser mode)
useApifyProxybooltrueRoute requests through Apify proxy
domainAllowlistarray[]Only emit pages whose host contains one of these substrings
domainBlocklistarray[]Drop pages whose host contains one of these substrings
minTextLengthintDrop pages with fewer than N characters of extracted text
excludeContentSelectorsarray[]Additional CSS selectors to strip
keywordsAnyOfarray[]Only emit pages containing at least one of these keywords

Example: search query

{
"query": "best vector database for RAG 2024",
"maxResults": 5,
"outputFormats": ["markdown"],
"minTextLength": 500,
"domainBlocklist": ["pinterest.com", "youtube.com"]
}

Example: direct URLs

{
"startUrls": [
"https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
"https://docs.langchain.com/docs/use-cases/qa/"
],
"outputFormats": ["markdown", "text"],
"htmlTransformer": "readable-text"
}

Example: filter for relevance

{
"query": "vector embeddings tutorial",
"maxResults": 10,
"keywordsAnyOf": ["embedding", "vector", "similarity"],
"minTextLength": 1000,
"outputFormats": ["markdown"]
}

Use cases

  • RAG ingestion — pull fresh top-N Google results for a topic, hand markdown to your embedder
  • News briefings — daily query like "AI news today", filter by minTextLength to drop SEO thin pages
  • Competitive monitoringdomainAllowlist of competitor domains, scrape their blogs weekly
  • Reference enrichment — feed each citation URL from a paper into the actor for clean text extraction
  • LLM context — give an LLM the cleaned markdown of pages, not raw HTML, to save tokens

FAQ

Does it require a login or cookies? No. All fetches are anonymous.

Is a proxy needed? Apify proxy is enabled by default to avoid Google rate-limits and unblock some target sites. You can disable it with useApifyProxy: false.

What's the difference between raw-http and browser-playwright? Raw HTTP uses curl_cffi with chrome131 TLS impersonation — fast and works on ~80% of sites. Browser mode runs headless Chromium, waits for JS, and dismisses cookie banners — slower but handles SPAs.

Why is description missing on some pages? Some pages don't expose a <meta name="description"> or og:description. The omit-empty contract drops missing fields rather than emit nulls.

Why does markdown look stripped down? We intentionally output simple markdown (headings, lists, links, emphasis, code) — RAG embedders strip most formatting anyway, and simpler markdown reduces token bloat.

What if all my filters reject every result? The actor finishes cleanly with a status message instead of pushing placeholder rows.

How do I use this with my LLM/RAG pipeline? Trigger this actor from your indexing job, read the dataset (each record has url + markdown), embed the markdown, store in your vector DB.