Apify's universal web scrapers

Looking for web scraping boilerplate code to save you development time? Our web scrapers will provide a solid base so you don't have to build your own scraping or automation tool from scratch. Just indicate which web pages to load and how to extract data from a web page. Configure and run your web scrapers manually in a user interface or programmatically via an API.

How do those scrapers work?

Start off with startURLs, follow page links with a Link selector or indicate Pseudo-URLs, and extract data into a dataset with pageFunction. Use these web scraping boilerplates for faster development of your data extraction or web automation solutions.

Pick a base for you

Let鈥檚 pick a boilerplate for you

A list of generic, universal scrapers suited for different libraries, browsers, and frameworks. If it鈥檚 a dynamic page with JavaScript rendering or you're buiding a browser automation tool, go for Web Scraper, Puppeteer or Playwright Scraper. If all you want is to send an HTTP request and get HTML back - less resource intensive scrapers like Cheerio, Vanilla JS or JSDOM will cover your needs.

Web Scraper 馃寪

Easiest-to-use scraping tool designed to navigate a headless Chromium browser. Gives you access to in-browser JavaScript with pageFunction executed in the browser context. Will extract structured data from a webpage with just a few lines of JavaScript code.

Puppeteer Scraper 馃憪

Top alternative to Apify Web Scraper. Full-browser solution with support for website login, recursive crawling, and batches of URLs in Chrome. pageFunction executed in Node.js context allows easy control of the browser. Will handle any React, Vue, or other heavy on the front end website.

Playwright Scraper 馃幁

Puppeteer on steroids. Full support of features that goes beyond Chromium-based browsers. Allows full programmatic control of Firefox and Safari with only a few commands executed in the Node.js environment. Suitable for building both scraping and web automation solutions.

Cheerio Scraper 馃崺

Quick and lightweight alternative to Web Scraper. Suitable for websites that don't render content dynamically. Powered by Cheerio library, this tool can process hundreds of raw HTML pages via plain HTTP requests. 20x faster scraping than using a full-browser solution.

Vanilla JS Scraper 馃崷

Non jQuery alternative to Cheerio Scraper. Well-suited for scraping web pages that do not rely on client-side JavaScript to serve their content. Achieves 20x faster performance than using a full-browser solution such as Puppeteer.

JSDOM Scraper 鈿狅笍

A balanced solution for HTML parsing. Fast like Cheerio Scraper, powerful like the browser scrapers. Powered by the JSDOM library, it can easily process client-side JavaScript without the real browser.

BeautifulSoup Scraper 馃嵅

Python alternative to Cheerio Scraper made for web pages that do not require client-side JavaScript. Beautiful Soup is a Python library for the easy parsing of HTML and XML documents. Its powerful search functions let you search for elements based on tags, attributes, or CSS classes.

Main features

Don't build your scraper from scratch

Explore web scraping boilerplates with options: lightweight and heavyweight, supporting the full browser or going by plain HTTP requests.

Launch web automation tools

Tap into Puppeteer and Playwright libraries. Run Chrome, Firefox and Safari, handle lists and queues of URLs, utilize automatic website login, and manage concurrency for maximum performance.

Extract data from any webpage

Extract data at any scale with a few lines of code and powerful infrastructure on your side. Use fingerprints based on real-world data, no configuration necessary.

Scale your scrapers in all-in-one platform

Rely on the Apify platform to simplify your web scraper development. Pick from the pool of proxies, create tasks, and schedule your scrapers. Bypass modern website anti-bot protection systems.

Frequently asked questions

What's the difference between Web Scraper and Puppeteer Scraper?

At a glance, it could seem like Web Scraper and Puppeteer Scraper are the same tool. In a way, they are since Web Scraper uses Puppeteer under the hood.

The difference is in the amount of control each can give you. Where Web Scraper's pageFunction only gives you access to in-browser JavaScript, Puppeteer Scraper's pageFunction executed in Node.js context practically turns the browser into your puppet; hence the naming. Puppeteer is also much easier to work with external APIs, databases, or the Apify SDK in the Node.js context.

So why would you choose one over the other? It depends on your priorities: is it power or ease of use? While Web Scraper is a simple, easy-to-start solution, Puppeteer Scraper is its more powerful older sibling, better suited for web scraping pros and complicated websites.

You can read more on their differences in our Docs.

Is Playwright better than Puppeteer?

Playwright is just a modern take on Puppeteer, so is one any better than the other? One of the major differences between them is that Playwright offers cross-browser support and is able to drive Chromium, WebKit (Safari's browser engine), and Firefox, while Puppeteer only supports Chromium. Besides, since Puppeteer is a Node.js library, it is restricted to JavaScript. But Playwright can go further than Node.js and be used with other languages - like Python or Java, for instance. The technology that you're going to use will be heavily dependent on your specific use case; differences in syntax matter a lot.

You can read more on the differences between Playwright and Puppeteer on our blog. Despite the differences between technologies powering them, both our Puppeteer and Playwright scrapers achieve similar performances.

Why Playwright is better than Selenium

While being a popular browser automation library, Selenium is still an older technology than Playwright, which could be a major contributor to why the former may appear slower, performance-wise. In addition, Playwright has many other functionalities that are not covered by Selenium. For instance, Playwright's modern Test Generator feature.

How do I authenticate proxy with Puppeteer?

When paired with a proxy, Puppeteer can truly be practically unstoppable, especially if we are talking about data extraction. Normally, when using a proxy requiring authentication in a non-headless browser (specifically Chrome), you'll be required to add credentials into a popup dialog. There are 4 ways to do that:

  1. use authenticate() method on the Puppeteer page object
  2. use proxy-chain NPM package
  3. within ProxyConfigurationOptions in the Apify SDK
  4. set the Proxy-Authorization header

You can read a detailed guide on how to use each one of those methods on our blog.

Can Puppeteer render JavaScript?

Yes. Puppeteer is, by design, an automation library with an off-label use for web scraping due to its ability to control browsers and render JavaScript. You can use Puppeteer Scraper to extract data from the websites that use JS to load their content dynamically.

What is Playwright used for?

Playwright is a powerful automation library. Due to its ability to control browsers like Chromium, Opera, Microsoft Edge, and Firefox while flawlessly rendering JavaScript, Playwright has become very popular in the web scraping community. You can use Playwright Scraper to extract data from the web, particularly the websites that use JavaScript to load their content dynamically.

How do you scrape with Playwright?

You can do build a scraper yourself by following these steps:

  1. Start a browser with Playwright
  2. Click buttons and wait for actions
  3. Extract data from a website

Or use our boilerplate, Playwright Scraper, and let it handle all overheads for you. If you are still going with option A, follow this step-by-step guide on how to build a scraper using Playwright.

How do I make my scraper run faster?

One way to do it is to opt for lightweight scrapers (Cheerio, Vanilla JS) instead of heavyweight, full-browser ones (Puppeteer, Playwright). You can scrape the page using only Cheerio Scraper. This would drastically reduce the time your scraper needs to finish the task. But for this option to work, you will require some prior knowledge such as a good understanding of the web page you're intending to scrape.

Your second option is API scraping. In this case, you don't even need to build a scraper. Your task will be to locate the website's API endpoints and collect the desired data directly from their API, as opposed to parsing the data from their rendered HTML pages. This way sometimes requires a little bit of detective work, but compared to other options, this one will be the fastest-performing.

What is Cheerio library?

Cheerio is a fast and flexible implementation of core jQuery designed to run server-side, working with raw HTML data. It also does not interpret the result as a web browser does. Instead, this library parses markup directly and provides an API for manipulating the resulting data structure. This makes Cheerio much faster than other web automation solutions.

Our HTML scraping tool, Cheerio Scraper, runs purely on the Cheerio library. You can give it a try and see how you can extract data from 10K pages for a mere $9.

What is the main difference between Cheerio and jQuery?

Cheerio is a lean implementation of core jQuery designed specifically for the server. jQuery is an open-source JavaScript library that simplifies DOM manipulation by allowing users to find, select and manipulate elements with specific properties, making it easier to navigate web applications.

On the other hand, Cheerio is an implementation of core jQuery in the server used for parsing HTML and XML in Node.js. This gives Cheerio a familiar syntax while providing a powerful API for traversing and manipulating resulting data structures, making it an essential tool for web scraping.

Therefore, Cheerio and jQuery are not really different since, at its core, Cheerio library uses jQuery library. The difference between them is that Cheerio is a library for a lot of things, while jQuery is applied specifically for querying selectors on web pages.

If you don't feel like learning jQuery to use it for your scraping projects, take a look at Vanilla JS Scraper 馃崷. It's a non-jQuery alternative which still uses raw HTTP requests to extract data from webpages using Node.js code.

Can Cheerio render JavaScript?

No, it can't. Cheerio library is an HTML parser which means it is not capable of interpreting results as a web browser does and, therefore, can't render JavaScript. If you need to scrape dynamically loading pages, you can opt for Puppeteer or Playwright Scraper.

How do I web scrape with Cheerio?

  1. Launch Cheerio Scraper.
  2. Set up StartURLs.
  3. Locate your data.
  4. Set up Page function.
  5. Run the scraper.

We have a short guide on how to extract data from any website using Cheerio Scraper, so don't miss out on quite a few straightforward tips.

However, if the website you are scraping has content that must be loaded dynamically, we advise using Web Scraper instead.

How can I scrape a website with Cheerio without using jQuery?

jQuery takes some time to get the hang of. Here's how you can scrape a website without having to learn it:

  1. Add StartURLs to the queue.
  2. Construct a DOM from the fetched HTML string.
  3. Execute Page function.
  4. Find all links from the page using the Link selector.
  5. Add unvisited PseudoURLs to the queue.

For each of these steps, you can avoid using jQuery by using an alternative scraper, Vanilla JS. Since it's built on the Cheerio library, it's just as efficient as Cheerio Scraper. And while Vanilla JS Scraper cannot be used to automate actions on the website, you can still use it to send thousands of requests within minutes.