Apify's universal web scrapers
Looking for web scraping boilerplate code to save you development time? Our web scrapers will provide a solid base so you don't have to build your own scraping or automation tool from scratch. Just indicate which web pages to load and how to extract data from a web page. Configure and run your web scrapers manually in a user interface or programmatically via an API.
How do those scrapers work?
Start off with startURLs
, follow page links with a Link selector
or indicate Pseudo-URLs
, and extract data into a dataset with pageFunction
. Use these web scraping boilerplates for faster development of your data extraction or web automation solutions.
Let’s pick a boilerplate for you
A list of generic, universal scrapers suited for different libraries, browsers, and frameworks. If it’s a dynamic page with JavaScript rendering or you're building a browser automation tool, go for Web Scraper, Puppeteer or Playwright Scraper. If all you want is to send an HTTP request and get HTML back - less resource intensive scrapers like Cheerio, Vanilla JS or JSDOM will cover your needs.
Puppeteer Scraper 👐
Top alternative to Apify Web Scraper. Full-browser solution with support for website login, recursive crawling, and batches of URLs in Chrome. pageFunction
executed in Node.js context allows easy control of the browser. Will handle any React, Vue, or other heavy on the front end website.
BeautifulSoup Scraper 🍲
Python alternative to Cheerio Scraper made for web pages that do not require client-side JavaScript. Beautiful Soup is a Python library for the easy parsing of HTML and XML documents. Its powerful search functions let you search for elements based on tags, attributes, or CSS classes.
Explore web scraping boilerplates with options: lightweight and heavyweight, supporting the full browser or going by plain HTTP requests.
Tap into Puppeteer and Playwright libraries. Run Chrome, Firefox and Safari, handle lists and queues of URLs, utilize automatic website login, and manage concurrency for maximum performance.
Extract data at any scale with a few lines of code and powerful infrastructure on your side. Use fingerprints based on real-world data, no configuration necessary.
Rely on the Apify platform to simplify your web scraper development. Pick from the pool of proxies, create tasks, and schedule your scrapers. Bypass modern website anti-bot protection systems.
At a glance, it could seem like Web Scraper and Puppeteer Scraper are the same tool. In a way, they are since Web Scraper uses Puppeteer under the hood.
The difference is in the amount of control each can give you. Where Web Scraper's pageFunction
only gives you access to in-browser JavaScript, Puppeteer Scraper's pageFunction
executed in Node.js context practically turns the browser into your puppet; hence the naming. Puppeteer is also much easier to work with external APIs, databases, or the Apify SDK in the Node.js context.
So why would you choose one over the other? It depends on your priorities: is it power or ease of use? While Web Scraper is a simple, easy-to-start solution, Puppeteer Scraper is its more powerful older sibling, better suited for web scraping pros and complicated websites.
You can read more on their differences in our Docs.
Playwright is just a modern take on Puppeteer, so is one any better than the other? One of the major differences between them is that Playwright offers cross-browser support and is able to drive Chromium, WebKit (Safari's browser engine), and Firefox, while Puppeteer only supports Chromium. Besides, since Puppeteer is a Node.js library, it is restricted to JavaScript. But Playwright can go further than Node.js and be used with other languages - like Python or Java, for instance. The technology that you're going to use will be heavily dependent on your specific use case; differences in syntax matter a lot.
You can read more on the differences between Playwright and Puppeteer on our blog. Despite the differences between technologies powering them, both our Puppeteer and Playwright scrapers achieve similar performances.
While being a popular browser automation library, Selenium is still an older technology than Playwright, which could be a major contributor to why the former may appear slower, performance-wise. In addition, Playwright has many other functionalities that are not covered by Selenium. For instance, Playwright's modern Test Generator feature.
When paired with a proxy, Puppeteer can truly be practically unstoppable, especially if we are talking about data extraction. Normally, when using a proxy requiring authentication in a non-headless browser (specifically Chrome), you'll be required to add credentials into a popup dialog. There are 4 ways to do that:
- use
authenticate()
method on the Puppeteerpage
object - use
proxy-chain
NPM package - within
ProxyConfigurationOptions
in the Apify SDK - set the
Proxy-Authorization
header
You can read a detailed guide on how to use each one of those methods on our blog.
Yes. Puppeteer is, by design, an automation library with an off-label use for web scraping due to its ability to control browsers and render JavaScript. You can use Puppeteer Scraper to extract data from the websites that use JS to load their content dynamically.
Playwright is a powerful automation library. Due to its ability to control browsers like Chromium, Opera, Microsoft Edge, and Firefox while flawlessly rendering JavaScript, Playwright has become very popular in the web scraping community. You can use Playwright Scraper to extract data from the web, particularly the websites that use JavaScript to load their content dynamically.
You can do build a scraper yourself by following these steps:
- Start a browser with Playwright
- Click buttons and wait for actions
- Extract data from a website
Or use our boilerplate, Playwright Scraper, and let it handle all overheads for you. If you are still going with option A, follow this step-by-step guide on how to build a scraper using Playwright.
One way to do it is to opt for lightweight scrapers (Cheerio, Vanilla JS) instead of heavyweight, full-browser ones (Puppeteer, Playwright). You can scrape the page using only Cheerio Scraper. This would drastically reduce the time your scraper needs to finish the task. But for this option to work, you will require some prior knowledge such as a good understanding of the web page you're intending to scrape.
Your second option is API scraping. In this case, you don't even need to build a scraper. Your task will be to locate the website's API endpoints and collect the desired data directly from their API, as opposed to parsing the data from their rendered HTML pages. This way sometimes requires a little bit of detective work, but compared to other options, this one will be the fastest-performing.
Cheerio is a fast and flexible implementation of core jQuery designed to run server-side, working with raw HTML data. It also does not interpret the result as a web browser does. Instead, this library parses markup directly and provides an API for manipulating the resulting data structure. This makes Cheerio much faster than other web automation solutions.
Our HTML scraping tool, Cheerio Scraper, runs purely on the Cheerio library. You can give it a try and see how you can extract data from 10K pages for a mere $9.
Cheerio is a lean implementation of core jQuery designed specifically for the server. jQuery is an open-source JavaScript library that simplifies DOM manipulation by allowing users to find, select and manipulate elements with specific properties, making it easier to navigate web applications.
On the other hand, Cheerio is an implementation of core jQuery in the server used for parsing HTML and XML in Node.js. This gives Cheerio a familiar syntax while providing a powerful API for traversing and manipulating resulting data structures, making it an essential tool for web scraping.
Therefore, Cheerio and jQuery are not really different since, at its core, Cheerio library uses jQuery library. The difference between them is that Cheerio is a library for a lot of things, while jQuery is applied specifically for querying selectors on web pages.
If you don't feel like learning jQuery to use it for your scraping projects, take a look at Vanilla JS Scraper 🍦. It's a non-jQuery alternative which still uses raw HTTP requests to extract data from webpages using Node.js code.
No, it can't. Cheerio library is an HTML parser which means it is not capable of interpreting results as a web browser does and, therefore, can't render JavaScript. If you need to scrape dynamically loading pages, you can opt for Puppeteer or Playwright Scraper.
- Launch Cheerio Scraper.
- Set up StartURLs.
- Locate your data.
- Set up Page function.
- Run the scraper.
We have a short guide on how to extract data from any website using Cheerio Scraper, so don't miss out on quite a few straightforward tips.
However, if the website you are scraping has content that must be loaded dynamically, we advise using Web Scraper instead.
jQuery takes some time to get the hang of. Here's how you can scrape a website without having to learn it:
- Add StartURLs to the queue.
- Construct a DOM from the fetched HTML string.
- Execute Page function.
- Find all links from the page using the Link selector.
- Add unvisited PseudoURLs to the queue.
For each of these steps, you can avoid using jQuery by using an alternative scraper, Vanilla JS. Since it's built on the Cheerio library, it's just as efficient as Cheerio Scraper. And while Vanilla JS Scraper cannot be used to automate actions on the website, you can still use it to send thousands of requests within minutes.