Q: What's the difference between Web Scraper and Puppeteer Scraper?

At a glance, it could seem like Web Scraper and Puppeteer Scraper are the same tool. In a way, they are since Web Scraper uses Puppeteer under the hood. The difference is in the amount of control each can give you. Where Web Scraper's pageFunction only gives you access to in-browser JavaScript, Puppeteer Scraper's pageFunction executed in Node.js context practically turns the browser into your puppet; hence the naming. Puppeteer is also much easier to work with external APIs, databases, or the Apify SDK in the Node.js context. So why would you choose one over the other? It depends on your priorities: is it power or ease of use? While Web Scraper is a simple, easy-to-start solution, Puppeteer Scraper is its more powerful older sibling, better suited for web scraping pros and complicated websites. You can read more on their differences in our Docs.

Question 1

What's the difference between Web Scraper and Puppeteer Scraper?

Accepted Answer

At a glance, it could seem like Web Scraper and Puppeteer Scraper are the same tool. In a way, they are since Web Scraper uses Puppeteer under the hood.

The difference is in the amount of control each can give you. Where Web Scraper's pageFunction only gives you access to in-browser JavaScript, Puppeteer Scraper's pageFunction executed in Node.js context practically turns the browser into your puppet; hence the naming. Puppeteer is also much easier to work with external APIs, databases, or the Apify SDK in the Node.js context.

So why would you choose one over the other? It depends on your priorities: is it power or ease of use? While Web Scraper is a simple, easy-to-start solution, Puppeteer Scraper is its more powerful older sibling, better suited for web scraping pros and complicated websites.

You can read more on their differences in our Docs.

Question 2

Is Playwright better than Puppeteer?

Accepted Answer

Playwright is just a modern take on Puppeteer, so is one any better than the other? One of the major differences between them is that Playwright offers cross-browser support and is able to drive Chromium, WebKit (Safari's browser engine), and Firefox, while Puppeteer only supports Chromium. Besides, since Puppeteer is a Node.js library, it is restricted to JavaScript. But Playwright can go further than Node.js and be used with other languages - like Python or Java, for instance. The technology that you're going to use will be heavily dependent on your specific use case; differences in syntax matter a lot.

You can read more on the differences between Playwright and Puppeteer on our blog. Despite the differences between technologies powering them, both our Puppeteer and Playwright scrapers achieve similar performances.

Question 3

Why Playwright is better than Selenium

Accepted Answer

While being a popular browser automation library, Selenium is still an older technology than Playwright, which could be a major contributor to why the former may appear slower, performance-wise. In addition, Playwright has many other functionalities that are not covered by Selenium. For instance, Playwright's modern Test Generator feature.

Question 4

How do I authenticate proxy with Puppeteer?

Accepted Answer

When paired with a proxy, Puppeteer can truly be practically unstoppable, especially if we are talking about data extraction. Normally, when using a proxy requiring authentication in a non-headless browser (specifically Chrome), you'll be required to add credentials into a popup dialog. There are 4 ways to do that:

use authenticate() method on the Puppeteer page object
use proxy-chain NPM package
within ProxyConfigurationOptions in the Apify SDK
set the Proxy-Authorization header

You can read a detailed guide on how to use each one of those methods on our blog.

Question 5

Can Puppeteer render JavaScript?

Accepted Answer

Yes. Puppeteer is, by design, an automation library with an off-label use for web scraping due to its ability to control browsers and render JavaScript. You can use Puppeteer Scraper to extract data from the websites that use JS to load their content dynamically.

Question 6

What is Playwright used for?

Accepted Answer

Playwright is a powerful automation library. Due to its ability to control browsers like Chromium, Opera, Microsoft Edge, and Firefox while flawlessly rendering JavaScript, Playwright has become very popular in the web scraping community. You can use Playwright Scraper to extract data from the web, particularly the websites that use JavaScript to load their content dynamically.

Question 7

How do you scrape with Playwright?

Accepted Answer

You can do build a scraper yourself by following these steps:

Start a browser with Playwright
Click buttons and wait for actions
Extract data from a website

Or use our boilerplate, Playwright Scraper, and let it handle all overheads for you. If you are still going with option A, follow this step-by-step guide on how to build a scraper using Playwright.

Question 8

How do I make my scraper run faster?

Accepted Answer

One way to do it is to opt for lightweight scrapers (Cheerio, Vanilla JS) instead of heavyweight, full-browser ones (Puppeteer, Playwright). You can scrape the page using only Cheerio Scraper. This would drastically reduce the time your scraper needs to finish the task. But for this option to work, you will require some prior knowledge such as a good understanding of the web page you're intending to scrape.

Your second option is API scraping. In this case, you don't even need to build a scraper. Your task will be to locate the website's API endpoints and collect the desired data directly from their API, as opposed to parsing the data from their rendered HTML pages. This way sometimes requires a little bit of detective work, but compared to other options, this one will be the fastest-performing.

Question 9

What is Cheerio library?

Accepted Answer

Cheerio is a fast and flexible implementation of core jQuery designed to run server-side, working with raw HTML data. It also does not interpret the result as a web browser does. Instead, this library parses markup directly and provides an API for manipulating the resulting data structure. This makes Cheerio much faster than other web automation solutions.

Our HTML scraping tool, Cheerio Scraper, runs purely on the Cheerio library. You can give it a try and see how you can extract data from 10K pages for a mere $9.

Question 10

What is the main difference between Cheerio and jQuery?

Accepted Answer

Cheerio is a lean implementation of core jQuery designed specifically for the server. jQuery is an open-source JavaScript library that simplifies DOM manipulation by allowing users to find, select and manipulate elements with specific properties, making it easier to navigate web applications.

On the other hand, Cheerio is an implementation of core jQuery in the server used for parsing HTML and XML in Node.js. This gives Cheerio a familiar syntax while providing a powerful API for traversing and manipulating resulting data structures, making it an essential tool for web scraping.

Therefore, Cheerio and jQuery are not really different since, at its core, Cheerio library uses jQuery library. The difference between them is that Cheerio is a library for a lot of things, while jQuery is applied specifically for querying selectors on web pages.

If you don't feel like learning jQuery to use it for your scraping projects, take a look at Vanilla JS Scraper 🍦. It's a non-jQuery alternative which still uses raw HTTP requests to extract data from webpages using Node.js code.

Question 11

Can Cheerio render JavaScript?

Accepted Answer

No, it can't. Cheerio library is an HTML parser which means it is not capable of interpreting results as a web browser does and, therefore, can't render JavaScript. If you need to scrape dynamically loading pages, you can opt for Puppeteer or Playwright Scraper.