
Web Scraper
- apify/web-scraper
- Modified
- Users 42.1k
- Runs 143M
- Created by
Apify
Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.
Optional
Enum
This property indicates the scraper's mode of operation. In DEVELOPMENT mode, the scraper ignores page timeouts, doesn't use sessionPool, opens pages one by one and enables debugging via Chrome DevTools. Open the live view tab or the container URL to access the debugger. Further debugging options can be configured in the Advanced configuration section. PRODUCTION mode disables debugging and enables timeouts and concurrency.
For details, see Run mode in README.
Required
array
A static list of URLs to scrape.
For details, see Start URLs in README.
Optional
boolean
Indicates that URL fragments (e.g. http://example.com#fragment
) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such a case, this option should be enabled.
Optional
string
A CSS selector saying which links on the page (<a>
elements with href
attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs and/or Glob patterns setting.
If Link selector is empty, the page links are ignored.
For details, see Link selector in README.
Optional
array
Specifies what kind of URLs found by Link selector should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in []
brackets, e.g. http://www.example.com/[.*]
.
If Pseudo-URLs are omitted, the actor enqueues all links matched by the Link selector.
For details, see Pseudo-URLs in README.
Required
string
JavaScript (ES6) function that is executed in the context of every page loaded in the Chrome browser. Use it to scrape data from the page, perform actions or add new URLs to the request queue.
For details, see Page function in README.
Optional
boolean
If enabled, the scraper will inject the jQuery library into every web page loaded, before Page function is invoked. Note that the jQuery object ($
) will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.jQuery
in Page function.
Required
object
Specifies proxy servers that will be used by the scraper in order to hide its origin.
For details, see Proxy configuration in README.
Optional
Enum
This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.
Optional
string
Use only english alphanumeric characters dashes and underscores. A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.
Optional
array
A JSON array with cookies that will be set to every Chrome browser tab opened before loading the page, in the format accepted by Puppeteer's Page.setCookie()
function. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this help article.
Optional
integer
The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.
If set to 0
, there is no limit.
Optional
integer
Specifies how many links away from Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using context.enqueuePage()
in Page function are not subject to the maximum depth constraint.
If set to 0
, there is no limit.
Optional
integer
Specified the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.
Optional
integer
The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to Max page retries), similarly as with other page load errors.
Optional
array
Contains a JSON array with names of page events to wait, before considering a web page fully loaded. The scraper will wait until all of the events are triggered in the web page before executing Page function. Available events are domcontentloaded
, load
, networkidle2
and networkidle0
.
For details, see waitUntil
option in Puppeteer's Page.goto()
function documentation.
Optional
Enum
This property has no effect if Run mode is set to PRODUCTION. When set to DEVELOPMENT it inserts a breakpoint at the selected location in every page the scraper visits. Execution of code stops at the breakpoint until manually resumed in the DevTools window accessible via Live View tab or Container URL. Additional breakpoints can be added by adding debugger;
statements within your Page function.
See Run mode in README for details.
Optional
boolean
If enabled, the actor log will include console messages produced by JavaScript executed by the web pages (e.g. using console.log()
). Beware that this may result in the log being flooded by error messages, warnings and other messages of little value, especially with high concurrency.