What is Web Scraper and what can it do?

Web Scraper is a versatile tool for extracting structured data from web pages using JavaScript code. It loads web pages in a browser, renders dynamic content, and allows you to extract data that can be stored in various formats such as JSON, XML, or CSV.

How can I use Web Scraper?

You can use Web Scraper either manually through a user interface or programmatically using the API. To get started, you need to specify the web pages to load and provide a JavaScript code called the Page function to extract data from the pages.

What are the costs associated with using Web Scraper?

The average usage cost for Web Scraper can be found on the pricing page under the Detailed pricing breakdown section. The cost estimates are based on averages and may vary depending on the complexity of the pages you scrape.

Are there any limitations to using Web Scraper?

Web Scraper is designed to be user-friendly and generic, which may affect its performance and flexibility compared to more specialized solutions. It uses a resource-intensive Chromium browser and supports client-side JavaScript code only.

Can I control the crawling behavior of Web Scraper?

Yes, you can control the crawling behavior of Web Scraper. You can specify start URLs, define link selectors, glob patterns, and pseudo-URLs to guide the scraper in following specific page links. This allows recursive crawling of websites or targeted extraction of data.

How can I extract data from web pages using Web Scraper?

To extract data from web pages, you need to provide a JavaScript code called the Page function. This function is executed in the context of each loaded web page. You can use client-side libraries like jQuery to manipulate the DOM and extract the desired data.

Is it possible to use proxies with Web Scraper?

Yes, you can configure proxies for Web Scraper. You have the option to use Apify Proxy, custom HTTP proxies, or SOCKS5 proxies. Proxies can help prevent detection by target websites and provide additional anonymity.

How can I handle authentication and login for websites with Web Scraper?

Web Scraper supports logging into websites by transferring cookies. You can set initial cookies in the “Initial cookies” field, which allows the scraper to use your session credentials. Cookies have a limited lifetime, so you may need to update them periodically.

How can I customize the behavior of Web Scraper?

Web Scraper provides advanced configuration options such as pre-navigation and post-navigation hooks and more. These options allow you to fine-tune the scraper’s behavior and perform additional actions during the scraping process.

How can I access and export the data scraped by Web Scraper?

The data scraped by Web Scraper is stored in a dataset. You can access and export this data in various formats such as JSON, XML, CSV, or as an Excel spreadsheet. The results can be downloaded using the Apify API or through the Apify Console. Check out the Apify API reference docs for full details.

Pricing

Pay per usage

Try for free

Go to Store

Web Scraper

Try for free

Developed by

Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (22)

Pricing

Pay per usage

695

Total users

82.4k

Monthly users

3.9k

Runs succeeded

>99%

Issue response

32 days

Last modified

18 days ago

Developer tools

Open source

Run mode

runModeEnumOptional

This property indicates the scraper's mode of operation. In DEVELOPMENT mode, the scraper ignores page timeouts, doesn't use sessionPool, opens pages one by one and enables debugging via Chrome DevTools. Open the live view tab or the container URL to access the debugger. Further debugging options can be configured in the Advanced configuration section. PRODUCTION mode disables debugging and enables timeouts and concurrency.

For details, see Run mode in README.

Value options:

"PRODUCTION": string"DEVELOPMENT": string

Default value of this property is "PRODUCTION"

Start URLs

startUrlsarrayRequired

A static list of URLs to scrape.

For details, see Start URLs in README.

URL #fragments identify unique pages

keepUrlFragmentsbooleanOptional

Indicates that URL fragments (e.g. http://example.com#fragment) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such a case, this option should be enabled.

Default value of this property is false

Respect the robots.txt file

respectRobotsTxtFilebooleanOptional

If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.

Default value of this property is false

Link selector

linkSelectorstringOptional

A CSS selector saying which links on the page (<a> elements with href attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs and/or Glob patterns setting.

If Link selector is empty, the page links are ignored.

For details, see Link selector in README.

Glob Patterns

globsarrayOptional

Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.

Default value of this property is []

Pseudo-URLs

pseudoUrlsarrayOptional

Specifies what kind of URLs found by Link selector should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in [] brackets, e.g. http://www.example.com/[.*].

If Pseudo-URLs are omitted, the Actor enqueues all links matched by the Link selector.

For details, see Pseudo-URLs in README.

Default value of this property is []

Exclude Glob Patterns

excludesarrayOptional

Glob patterns to match links in the page that you want to exclude from being enqueued.

Default value of this property is []

Page function

pageFunctionstringRequired

JavaScript (ES6) function that is executed in the context of every page loaded in the Chrome browser. Use it to scrape data from the page, perform actions or add new URLs to the request queue.

For details, see Page function in README.

Inject jQuery

injectJQuerybooleanOptional

If enabled, the scraper will inject the jQuery library into every web page loaded, before Page function is invoked. Note that the jQuery object ($) will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.jQuery in Page function.

Default value of this property is true

Proxy configuration

proxyConfigurationobjectRequired

Specifies proxy servers that will be used by the scraper in order to hide its origin.

For details, see Proxy configuration in README.

Default value of this property is {"useApifyProxy":true}

Proxy rotation

proxyRotationEnumOptional

This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.

Value options:

"RECOMMENDED": string"PER_REQUEST": string"UNTIL_FAILURE": string

Default value of this property is "RECOMMENDED"

Session pool name

sessionPoolNamestringOptional

Use only english alphanumeric characters dashes and underscores. A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.

Initial cookies

initialCookiesarrayOptional

A JSON array with cookies that will be set to every Chrome browser tab opened before loading the page, in the format accepted by Puppeteer's Page.setCookie() function. This option is useful for transferring a logged-in session from an external web browser.

Default value of this property is []

Use Chrome

useChromebooleanOptional

If enabled, the scraper will use a real Chrome browser instead of Chromium bundled with Puppeteer. This option may help bypass certain anti-scraping protections, but might make the scraper unstable. Use at your own risk 🙂

Default value of this property is false

Run browsers in headless mode

headlessbooleanOptional

By default, browsers run in headless mode. You can toggle this off to run them in headful mode, which can help with certain rare anti-scraping protections but is slower and more costly.

Default value of this property is true

Ignore SSL errors

ignoreSslErrorsbooleanOptional

If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.

Default value of this property is false

Ignore CORS and CSP

ignoreCorsAndCspbooleanOptional

If enabled, the scraper will ignore Content Security Policy (CSP) and Cross-Origin Resource Sharing (CORS) settings of visited pages and requested domains. This enables you to freely use XHR/Fetch to make HTTP requests from Page function.

Default value of this property is false

Download media files

downloadMediabooleanOptional

If enabled, the scraper will download media such as images, fonts, videos and sound files, as usual. Disabling this option might speed up the scrape, but certain websites could stop working correctly.

Default value of this property is true

Download CSS files

downloadCssbooleanOptional

If enabled, the scraper will download CSS files with stylesheets, as usual. Disabling this option may speed up the scrape, but certain websites could stop working correctly, and the live view will not look as cool.

Default value of this property is true

Max page retries

maxRequestRetriesintegerOptional

The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by Page function.

If set to 0, the page will be considered failed right after the first error.

Default value of this property is 3

Max pages per run

maxPagesPerCrawlintegerOptional

The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.

If set to 0, there is no limit.

Default value of this property is 0

Max result records

maxResultsPerCrawlintegerOptional

The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached.

If set to 0, there is no limit.

Default value of this property is 0

Max crawling depth

maxCrawlingDepthintegerOptional

Specifies how many links away from Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using context.enqueuePage() in Page function are not subject to the maximum depth constraint.

If set to 0, there is no limit. To crawl only the pages specified by the Start URLs, set linkSelector empty instead.

Default value of this property is 0

Max concurrency

maxConcurrencyintegerOptional

Specified the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.

Default value of this property is 50

Page load timeout

pageLoadTimeoutSecsintegerOptional

The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to Max page retries), similarly as with other page load errors.

Default value of this property is 60

Page function timeout

pageFunctionTimeoutSecsintegerOptional

The maximum amount of time the scraper will wait for Page function to execute, in seconds. It's a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.

Default value of this property is 60

Navigation waits until

waitUntilarrayOptional

Contains a JSON array with names of page events to wait, before considering a web page fully loaded. The scraper will wait until all of the events are triggered in the web page before executing Page function. Available events are domcontentloaded, load, networkidle2 and networkidle0.

For details, see waitUntil option in Puppeteer's Page.goto() function documentation.

Default value of this property is ["networkidle2"]

Pre-navigation hooks

preNavigationHooksstringOptional

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, crawlingContext and gotoOptions, which are passed to the page.goto() function the crawler calls to navigate.

Post-navigation hooks

postNavigationHooksstringOptional

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts crawlingContext as the only parameter.

Insert breakpoint

breakpointLocationEnumOptional

This property has no effect if Run mode is set to PRODUCTION. When set to DEVELOPMENT it inserts a breakpoint at the selected location in every page the scraper visits. Execution of code stops at the breakpoint until manually resumed in the DevTools window accessible via Live View tab or Container URL. Additional breakpoints can be added by adding debugger; statements within your Page function.

See Run mode in README for details.

Value options:

"NONE": string"BEFORE_GOTO": string"BEFORE_PAGE_FUNCTION": string"AFTER_PAGE_FUNCTION": string

Default value of this property is "NONE"

Dismiss cookie modals

closeCookieModalsbooleanOptional

Using the I don't care about cookies browser extension. When on, the crawler will automatically try to dismiss cookie consent modals. This can be useful when crawling European websites that show cookie consent modals.

Default value of this property is false

Maximum scrolling distance in pixels

maxScrollHeightPixelsintegerOptional

The crawler will scroll down the page until all content is loaded or the maximum scrolling distance is reached. Setting this to 0 disables scrolling altogether.

Default value of this property is 5000

Debug log

debugLogbooleanOptional

If enabled, the Actor log will include debug messages. Beware that this can be quite verbose. Use context.log.debug('message') to log your own debug messages from Page function.

Default value of this property is false

Browser log

browserLogbooleanOptional

If enabled, the Actor log will include console messages produced by JavaScript executed by the web pages (e.g. using console.log()). Beware that this may result in the log being flooded by error messages, warnings and other messages of little value, especially with high concurrency.

Default value of this property is false

Custom data

customDataobjectOptional

A custom JSON object that is passed to Page function as context.customData. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.

Default value of this property is {}

Dataset name

datasetNamestringOptional

Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.

Key-value store name

keyValueStoreNamestringOptional

Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used.

Request queue name

requestQueueNamestringOptional

Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used.

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

7.5k

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

6.6k

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

1.4k

Scrape And Bypass Any Url Using Scrappey

dormic/apify-scrappey

A template for scraping data from web pages using the Scrappey.com API service integrated with an Apify Actor. This actor provides a robust solution for handling complex web scraping scenarios, including sites with anti-bot protection such as Cloudflare, Datadome, PerimeterX and all other forms.

Pim

Dynamic Web Scraper

josejet/dynamic-web-scraper

Dynamic Web Scraper is an Apify Actor that gathers information online by simulating user browsing behavior on the web. It reduces the time and amount of scraped web pages by using a model (ChatGPT) to make decisions regarding browser navigation and results evaluation.

Pepa <b>J</b>

Cloudflare Web Scraper

ecomscrape/cloudflare-web-scraper

Cloudflare Web Scraper extracts data from Cloudflare-protected websites. You can customize parameters such as proxies, timeouts, and JavaScript execution, making it ideal for reports, spreadsheets, and applications.

ecomscrape

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

461

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

48.7k

Web Scraping API

zeeb0t/web-scraping-api---scrape-any-website

Web Scraping API that quickly and reliably scrapes any website—no selectors required. Premium proxies, CAPTCHA solving, JavaScript rendering, and automated structured data extraction are all included. It’s just $2 per 1,000 web pages scraped, with no minimum spend.

Anthony Ziebell

393

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.