RAG Web Browser avatar

RAG Web Browser

Try for free

No credit card required

View all Actors
RAG Web Browser

RAG Web Browser

apify/rag-web-browser
Try for free

No credit card required

Web browser for a retrieval augmented generation workflows. Retrieve and return website content from the top Google Search Results Pages

Do you want to learn more about this Actor?

Get a demo

Search term(s)

querystringOptional

Use regular search words or enter Google Search URLs. You can also apply advanced Google search techniques, such as AI site:twitter.com or javascript OR python

Number of top search results to return from Google. Only organic results are returned and counted

maxResultsintegerOptional

The number of top organic search results to return and scrape text from

Output formats

outputFormatsarrayOptional

Select the desired output formats for the retrieved content

Default value of this property is ["text"]

Request timeout in seconds

requestTimeoutSecsintegerOptional

The maximum time (in seconds) allowed for request. If the request exceeds this time, it will be marked as failed and only already finished results will be returned

Default value of this property is 60

Search Proxy Group

proxyGroupSearchEnumOptional

Select the proxy group for loading search results

Value options:

"GOOGLE_SERP": string"SHADER": string

Default value of this property is "GOOGLE_SERP"

Maximum number of retries for Google search request on network / server errors

maxRequestRetriesSearchintegerOptional

The maximum number of times the Google search crawler will retry the request on network, proxy or server errors. If the (n+1)-th request still fails, the crawler will mark this request as failed.

Default value of this property is 1

Crawler: Proxy configuration

proxyConfigurationobjectOptional

Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.

Default value of this property is {"useApifyProxy":true}

Initial concurrency

initialConcurrencyintegerOptional

Initial number of Playwright browsers running in parallel. The system scales this value based on CPU and memory usage.

Default value of this property is 3

Minimal concurrency

minConcurrencyintegerOptional

Minimum number of Playwright browsers running in parallel. Useful for defining a base level of parallelism.

Default value of this property is 10

Maximal concurrency

maxConcurrencyintegerOptional

Maximum number of browsers or clients running in parallel to avoid overloading target websites.

Default value of this property is 10

Maximum number of retries for Playwright content crawler

maxRequestRetriesintegerOptional

Maximum number of retry attempts on network, proxy, or server errors. If the (n+1)-th request fails, it will be marked as failed.

Default value of this property is 1

Request timeout for content crawling

requestTimeoutContentCrawlSecsintegerOptional

Timeout (in seconds) for making requests for each search result, including fetching and processing its content.

The value must be smaller than the 'Request timeout in seconds' setting.

Default value of this property is 30

Wait for dynamic content (seconds)

dynamicContentWaitSecsintegerOptional

Maximum time (in seconds) to wait for dynamic content to load. The crawler processes the page once this time elapses or when the network becomes idle.

Default value of this property is 10

Remove cookie warnings

removeCookieWarningsbooleanOptional

If enabled, removes cookie consent dialogs to improve text extraction accuracy. Note that this will impact latency.

Default value of this property is true

Debug mode (stores debugging information in dataset)

debugModebooleanOptional

If enabled, the Actor will store debugging information in the dataset's debug field

Default value of this property is false

Developer
Maintained by Apify
Actor metrics
  • 3 monthly users
  • 2 stars
  • 100.0% runs succeeded
  • Created in Sep 2024
  • Modified 3 days ago
Categories