Vanilla JS Scraper
No credit card required
Vanilla JS Scraper
No credit card required
Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.
Requests
requests
arrayRequired
A static list of URLs to scrape.
For details, see the Start URLs section in the README.
Pseudo-URLs
pseudoUrls
arrayOptional
Specifies what kind of URLs found by the Link selector should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in []
brackets, e.g. http://www.example.com/[.*]
.
If Pseudo-URLs are omitted, the actor enqueues all links matched by the Link selector.
For details, see Pseudo-URLs in README.
Default value of this property is []
Link selector
linkSelector
stringOptional
A CSS selector stating which links on the page (<a>
elements with href
attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs field.
If the Link selector is empty, the page links are ignored.
For details, see the Link selector in README.
Page function
pageFunction
stringRequired
A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue.
For details, see Page function in README.
Pre-navigation hooks
preNavigationHooks
stringOptional
Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, crawlingContext
and requestAsBrowserOptions
, which are passed to the requestAsBrowser()
function the crawler calls to navigate.
Post-navigation hooks
postNavigationHooks
stringOptional
Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts crawlingContext
as the only parameter.
Proxy configuration
proxy
objectOptional
Specifies proxy servers that will be used by the scraper in order to hide its origin.
For details, see Proxy configuration in README.
Default value of this property is {"useApifyProxy":false}
Debug log
debug
booleanOptional
Include debug messages in the log?
Default value of this property is false
Max concurrency
maxConcurrency
integerOptional
Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.
Default value of this property is 50
Max request retries
maxRequestRetries
integerOptional
The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the Page function.
If set to 0
, the page will be considered failed right after the first error.
Default value of this property is 3
Page load timeout
pageLoadTimeoutSecs
integerOptional
The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to Max page retries), similarly as with other page load errors.
Default value of this property is 60
Page function timeout
pageFunctionTimeoutSecs
integerOptional
The maximum amount of time the scraper will wait for the Page function to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.
Default value of this property is 60
Ignore SSL errors
ignoreSslErrors
booleanOptional
If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.
Default value of this property is false
Additional MIME types
additionalMimeTypes
arrayOptional
A JSON array specifying additional MIME content types of web pages to support. By default, Cheerio Scraper supports the text/html
and application/xhtml+xml
content types, and skips all other resources. For details, see Content types in README.
Default value of this property is []
Dataset name
datasetName
stringOptional
Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.
- 11 monthly users
- 3 stars
- 99.3% runs succeeded
- Created in Mar 2022
- Modified about 1 year ago