Pricing

Pay per usage

Try for free

Go to Apify Store

JSDOM Scraper

Try for free

Developed by

Apify

Parses the HTML using the JSDOM library, providing the same DOM API as browsers do (e.g. `window`). It is able to process client-side JavaScript without using a real browser. Performance-wise, it stands somewhere between the Cheerio Scraper and the browser scrapers.

4.3 (3)

Pricing

Pay per usage

Last modified

3 months ago

Developer tools

Open source

Start URLs

startUrlsarrayRequired

A static list of URLs to scrape.

For details, see the Start URLs section in the README.

URL #fragments identify unique pages

keepUrlFragmentsbooleanOptional

Indicates that URL fragments (e.g. http://example.com#fragment) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such cases, this option should be enabled.

Default value of this property is false

Respect the robots.txt file

respectRobotsTxtFilebooleanOptional

If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.

Default value of this property is false

Glob Patterns

globsarrayOptional

Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.

Default value of this property is []

Pseudo-URLs

pseudoUrlsarrayOptional

Specifies what kind of URLs found by the Link selector should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in [] brackets, e.g. http://www.example.com/[.*].

If Pseudo-URLs are omitted, the Actor enqueues all links matched by the Link selector.

For details, see Pseudo-URLs in README.

Default value of this property is []

Exclude Glob Patterns

excludesarrayOptional

Glob patterns to match links in the page that you want to exclude from being enqueued.

Default value of this property is []

Link selector

linkSelectorstringOptional

A CSS selector stating which links on the page (<a> elements with href attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs and/or Glob patterns field.

If the Link selector is empty, the page links are ignored.

For details, see the Link selector in README.

Run scripts

runScriptsbooleanOptional

Whether to execute JavaScript in the downloaded page. If enabled, the JSDOM engine will process the JavaScript in the page as if it was loaded in a browser. This is useful for pages that use JavaScript to render the content, but it can also cause secuirty issues.

Default value of this property is true

Show internal console logs

showInternalConsolebooleanOptional

Whether to show internal JSDOM console logs in the log output. This is useful for debugging the page function and seeing what's happening in the JSDOM environment.

Default value of this property is false

Page function

pageFunctionstringRequired

A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue.

For details, see Page function in README.

Proxy configuration

proxyConfigurationobjectRequired

Specifies proxy servers that will be used by the scraper in order to hide its origin.

For details, see Proxy configuration in README.

Default value of this property is {"useApifyProxy":true}

Proxy rotation

proxyRotationEnumOptional

This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.

Value options:

"RECOMMENDED": string"PER_REQUEST": string"UNTIL_FAILURE": string

Default value of this property is "RECOMMENDED"

Session pool name

sessionPoolNamestringOptional

Use only english alphanumeric characters dashes and underscores. A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.

Initial cookies

initialCookiesarrayOptional

A JSON array with cookies that will be send with every HTTP request made by the JSDOM Scraper, in the format accepted by the tough-cookie NPM package. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this help article.

Default value of this property is []

Additional MIME types

additionalMimeTypesarrayOptional

A JSON array specifying additional MIME content types of web pages to support. By default, JSDOM Scraper supports the text/html and application/xhtml+xml content types, and skips all other resources. For details, see Content types in README.

Default value of this property is []

Suggest response encoding

suggestResponseEncodingstringOptional

The scraper automatically determines response encoding from the response headers. If the headers are invalid or information is missing, malformed responses may be produced. Use the Suggest response encoding option to provide a fall-back encoding to the Scraper for cases where it could not be determined.

Force response encoding

forceResponseEncodingbooleanOptional

If enabled, the suggested response encoding will be used even if a valid response encoding is provided by the target website. Use this only when you've inspected the responses thoroughly and are sure that they are the ones doing it wrong.

Default value of this property is false

Ignore SSL errors

ignoreSslErrorsbooleanOptional

If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.

Default value of this property is false

Pre-navigation hooks

preNavigationHooksstringOptional

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, crawlingContext and requestAsBrowserOptions, which are passed to the requestAsBrowser() function the crawler calls to navigate.

Post-navigation hooks

postNavigationHooksstringOptional

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts crawlingContext as the only parameter.

Max request retries

maxRequestRetriesintegerOptional

The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the Page function.

If set to 0, the page will be considered failed right after the first error.

Default value of this property is 3

Max pages per run

maxPagesPerCrawlintegerOptional

The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It is always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.

If set to 0, there is no limit.

Default value of this property is 0

Max result records

maxResultsPerCrawlintegerOptional

The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached.

If set to 0, there is no limit.

Default value of this property is 0

Max crawling depth

maxCrawlingDepthintegerOptional

Specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using context.enqueuePage() in Page function are not subject to the maximum depth constraint.

If set to 0, there is no limit.

Default value of this property is 0

Max concurrency

maxConcurrencyintegerOptional

Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.

Default value of this property is 50

Page load timeout

pageLoadTimeoutSecsintegerOptional

The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to Max page retries), similarly as with other page load errors.

Default value of this property is 60

Page function timeout

pageFunctionTimeoutSecsintegerOptional

The maximum amount of time the scraper will wait for the Page function to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.

Default value of this property is 60

Enable debug log

debugLogbooleanOptional

If enabled, the Actor log will include debug messages. Beware that this can be quite verbose. Use context.log.debug('message') to log your own debug messages from the Page function.

Default value of this property is false

Custom data

customDataobjectOptional

A custom JSON object that is passed to the Page function as context.customData. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.

Default value of this property is {}

Dataset name

datasetNamestringOptional

Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.

Key-value store name

keyValueStoreNamestringOptional

Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used.

Request queue name

requestQueueNamestringOptional

Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used.

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

475

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

886

4.2

Torrent Scraper

ondrejklinovsky/torrent-scraper

Scrape information about torrents from popular torrent sites. Download the data in JSON, CSV, and Excel.

Ondrej Klinovský

161

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

9.6K

4.8

Legacy PhantomJS Crawler

apify/legacy-phantomjs-crawler

Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.

Apify

1.6K

5.0

Galaxus API Scraper

petr_cermak/galaxus-scraper

Petr Cermak

API / JSON scraper

pocesar/json-downloader

Scrape any API / JSON URLs directly to the dataset, and return them in CSV, XML, HTML, or Excel formats. Transform and filter the output. Enables you to follow pagination recursively from the payload without the need to visit the HTML page.

Paulo Cesar

540

Save To S3

drinksight/save-to-s3

Designed to be run from an ACTOR.RUN.SUCCEEDED webhook, this actor downloads a task run's default dataset and saves it to an S3 bucket.

Richard Weaver

101

Example Secret Input

apify/example-secret-input

This Apify actor showcases how to use secret fields in the actor input.

Apify

4.7

Stealth Scraper

lolio9/stealth-scraper

A stealthy, headless browser-based scraper that mimics human behavior to avoid detection. Automatically saves every visited HTML page and downloadable file, incrementally archiving progress. Perfect for large websites, internal networks, or compliance-sensitive environments.

Marcus

How to parse HTML in JavaScript

Web scraping with Cheerio in 2025

Web scraping with JavaScript vs. Python in 2025

JSDOM Scraper

JSDOM Scraper

Start URLs

URL #fragments identify unique pages

Respect the robots.txt file

Glob Patterns

Pseudo-URLs

Exclude Glob Patterns

Link selector

Run scripts

Show internal console logs

Page function

Proxy configuration

Proxy rotation

Value options:

Session pool name

Initial cookies

Additional MIME types

Suggest response encoding

Force response encoding

Ignore SSL errors

Pre-navigation hooks

Post-navigation hooks

Max request retries

Max pages per run

Max result records

Max crawling depth

Max concurrency

Page load timeout

Page function timeout

Enable debug log

Custom data

Dataset name

Key-value store name

Request queue name

You might also like

Vanilla JS Scraper

BeautifulSoup Scraper

Torrent Scraper

Cheerio Scraper

Legacy PhantomJS Crawler

Galaxus API Scraper

API / JSON scraper

Save To S3

Example Secret Input

Stealth Scraper

Related articles

Start URLs

URL #fragments identify unique pages

Respect the robots.txt file

Glob Patterns

Pseudo-URLs

Exclude Glob Patterns

Link selector

Run scripts

Show internal console logs

Page function

Proxy configuration

Proxy rotation

Value options:

Session pool name

Initial cookies

Additional MIME types

Suggest response encoding

Force response encoding

Ignore SSL errors

Pre-navigation hooks

Post-navigation hooks

Max request retries

Max pages per run

Max result records

Max crawling depth

Max concurrency

Page load timeout

Page function timeout

Enable debug log

Custom data

Dataset name

Key-value store name

Request queue name