RAG Web Browser
No credit card required
RAG Web Browser
No credit card required
Web browser for OpenAI Assistants API and RAG pipelines, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages from the results, and returns their cleaned content as Markdown for further processing by an LLM.
Do you want to learn more about this Actor?
Get a demoSearch term or URL
query
stringRequired
Enter Google Search keywords or a URL of a specific web page. The keywords might include the advanced search operators. Examples:
san francisco weather
https://www.cnn.com
function calling site:openai.com
Maximum results
maxResults
integerOptional
The maximum number of top organic Google Search results whose web pages will be extracted. If query
is a URL, then this field is ignored and the Actor only fetches the specific web page.
Default value of this property is 3
Output formats
outputFormats
arrayOptional
Select one or more formats to which the target web pages will be extracted and saved in the resulting dataset.
Default value of this property is ["markdown"]
Request timeout
requestTimeoutSecs
integerOptional
The maximum time in seconds available for the request, including querying Google Search and scraping the target web pages. For example, OpenAI allows only 45 seconds for custom actions. If a target page loading and extraction exceeds this timeout, the corresponding page will be skipped in results to ensure at least some results are returned within the timeout. If no page is extracted within the timeout, the whole request fails.
Default value of this property is 40
SERP proxy group
serpProxyGroup
EnumOptional
Enables overriding the default Apify Proxy group used for fetching Google Search results.
Value options:
"GOOGLE_SERP": string"SHADER": string
Default value of this property is "GOOGLE_SERP"
SERP max retries
serpMaxRetries
integerOptional
The maximum number of times the Actor will retry fetching the Google Search results on error. If the last attempt fails, the entire request fails.
Default value of this property is 2
Proxy configuration
proxyConfiguration
objectOptional
Apify Proxy configuration used for scraping the target web pages.
Default value of this property is {"useApifyProxy":true}
Remove HTML elements (CSS selector)
removeElementsCssSelector
stringOptional
A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content. The value must be a valid CSS selector as accepted by the document.querySelectorAll()
function.
By default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like dummy_keep_everything
.
Default value of this property is "nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"
HTML transformer
htmlTransformer
stringOptional
Specify how to transform the HTML to extract meaningful content without any extra fluff, like navigation or modals. The HTML transformation happens after removing and clicking the DOM elements.
-
None (default) - Only removes the HTML elements specified via 'Remove HTML elements' option.
-
Readable text - Extracts the main contents of the webpage, without navigation and other fluff.
Default value of this property is "none"
Initial browsing concurrency
initialConcurrency
integerOptional
The initial number of web browsers running in parallel. The system automatically scales the number based on the CPU and memory usage, in the range specified by minConcurrency
and maxConcurrency
. If the initial value is 0
, the Actor picks the number automatically based on the available memory.
Default value of this property is 4
Minimum browsing concurrency
minConcurrency
integerOptional
The minimum number of web browsers running in parallel.
Default value of this property is 1
Maximum browsing concurrency
maxConcurrency
integerOptional
The maximum number of web browsers running in parallel.
Default value of this property is 50
Target page max retries
maxRequestRetries
integerOptional
The maximum number of times the Actor will retry loading the target web page on error. If the last attempt fails, the page will be skipped in the results.
Default value of this property is 1
Target page dynamic content timeout
dynamicContentWaitSecs
integerOptional
The maximum time in seconds to wait for dynamic page content to load. The Actor considers the web page as fully loaded once this time elapses or when the network becomes idle.
Default value of this property is 10
Actor Metrics
133 monthly users
-
42 stars
96% runs succeeded
Created in Sep 2024
Modified a day ago