Actor picture

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Author's avatarApify Technologies
  • Modified
  • Used by1,065 users
  • Used11,013,832 times
Actor picture
Cheerio Scraper

Start URLs

startUrls

Required

array

A static list of URLs to scrape. To be able to add new URLs on the fly, enable the Use request queue option. For details, see the Start URLs section in the README.

URL #fragments identify unique pages

keepUrlFragments

Optional

boolean

Indicates that URL fragments (e.g. http://example.com#fragment) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such cases, this option should be enabled.

Pseudo-URLs

pseudoUrls

Optional

array

Specifies what kind of URLs found by the Link selector should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in [] brackets, e.g. http://www.example.com/[.*]. This setting only applies if Use request queue is enabled. If Pseudo-URLs are omitted, the actor enqueues all links matched by the Link selector. For details, see Pseudo-URLs in README.

Link selector

linkSelector

Optional

string

A CSS selector stating which links on the page (<a> elements with href attribute) shall be followed and added to the request queue. This setting only applies if Use request queue is enabled. To filter the links added to the queue, use the Pseudo-URLs field. If the Link selector is empty, the page links are ignored. For details, see the Link selector in README.

Page function

pageFunction

Required

string

A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue. For details, see Page function in README.

Proxy configuration

proxyConfiguration

Optional

object

Specifies proxy servers that will be used by the scraper in order to hide its origin. For details, see Proxy configuration in README.

Proxy rotation

proxyRotation

Optional

string

This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.

Options:

"RECOMMENDED", "PER_REQUEST", "UNTIL_FAILURE"

Session pool name

sessionPoolName

Optional

string

Use only english alphanumeric characters dashes and underscores. A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.

Initial cookies

initialCookies

Optional

array

A JSON array with cookies that will be send with every HTTP request made by the Cheerio Scraper, in the format accepted by the tough-cookie NPM package. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this help article.

Prepare request function

prepareRequestFunction

Optional

string

A JavaScript (Node.js 12) function that is executed before making a request to a given URL. Its sole argument is an object with two properties: { request, Apify }. It can be used for pre-processing, updating headers, or just setting cookies. IMPORTANT: The return value of this function is irrelevant, you should modify the request instance in place.

Additional MIME types

additionalMimeTypes

Optional

array

A JSON array specifying additional MIME content types of web pages to support. By default, Cheerio Scraper supports the text/html and application/xhtml+xml content types, and skips all other resources. For details, see Content types in README.

Suggest response encoding

suggestResponseEncoding

Optional

string

The scraper automatically determines response encoding from the response headers. If the headers are invalid or information is missing, malformed responses may be produced. Use the Suggest response encoding option to provide a fall-back encoding to the Scraper for cases where it could not be determined.

Force response encoding

forceResponseEncoding

Optional

boolean

If enabled, the suggested response encoding will be used even if a valid response encoding is provided by the target website. Use this only when you've inspected the responses thoroughly and are sure that they are the ones doing it wrong.

Ignore SSL errors

ignoreSslErrors

Optional

boolean

If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.

Max request retries

maxRequestRetries

Optional

integer

The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the Page function. If set to 0, the page will be considered failed right after the first error.

Max pages per run

maxPagesPerCrawl

Optional

integer

The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It is always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value. If set to 0, there is no limit.

Max result records

maxResultsPerCrawl

Optional

integer

The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached. If set to 0, there is no limit.

Max crawling depth

maxCrawlingDepth

Optional

integer

Specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using context.enqueuePage() in Page function are not subject to the maximum depth constraint. If set to 0, there is no limit.

Max concurrency

maxConcurrency

Optional

integer

Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.

Page load timeout

pageLoadTimeoutSecs

Optional

integer

The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to Max page retries), similarly as with other page load errors.

Page function timeout

pageFunctionTimeoutSecs

Optional

integer

The maximum amount of time the scraper will wait for the Page function to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.

Enable debug log

debugLog

Optional

boolean

If enabled, the actor log will include debug messages. Beware that this can be quite verbose. Use context.log.debug('message') to log your own debug messages from the Page function.

Custom data

customData

Optional

object

A custom JSON object that is passed to the Page function as context.customData. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.

Dataset name

datasetName

Optional

string

Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.

Key-value store name

keyValueStoreName

Optional

string

Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used.

Request queue name

requestQueueName

Optional

string

Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used.