Crawlee + Puppeteer + Chrome
Example of a Puppeteer and headless Chrome web scraper. Headless browsers render JavaScript and are harder to block, but they're slower than plain HTTP.
src/main.js
src/routes.js
1// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/).
2import { Actor } from 'apify';
3// Web scraping and browser automation library (Read more at https://crawlee.dev)
4import { PuppeteerCrawler } from 'crawlee';
5// this is ESM project, and as such, it requires you to specify extensions in your relative imports
6// read more about this here: https://nodejs.org/docs/latest-v18.x/api/esm.html#mandatory-file-extensions
7import { router } from './routes.js';
8
9// The init() call configures the Actor for its environment. It's recommended to start every Actor with an init().
10await Actor.init();
11
12// Define the URLs to start the crawler with - get them from the input of the Actor or use a default list.
13const input = await Actor.getInput();
14const startUrls = input?.startUrls || [{ url: 'https://apify.com' }];
15
16// Create a proxy configuration that will rotate proxies from Apify Proxy.
17const proxyConfiguration = await Actor.createProxyConfiguration();
18
19// Create a PuppeteerCrawler that will use the proxy configuration and and handle requests with the router from routes.js file.
20const crawler = new PuppeteerCrawler({
21 proxyConfiguration,
22 requestHandler: router,
23 launchContext: {
24 launchOptions: {
25 args: [
26 '--disable-gpu', // Mitigates the "crashing GPU process" issue in Docker containers
27 ]
28 }
29 }
30});
31
32// Run the crawler with the start URLs and wait for it to finish.
33await crawler.run(startUrls);
34
35// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit().
36await Actor.exit();
JavaScript PuppeteerCrawler Actor template
This template is a production ready boilerplate for developing with PuppeteerCrawler
. The PuppeteerCrawler
provides a simple framework for parallel crawling of web pages using headless Chrome with Puppeteer. Since PuppeteerCrawler
uses headless Chrome to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript.
If you're looking for examples or want to learn more visit:
Included features
- Puppeteer Crawler - simple framework for parallel crawling of web pages using headless Chrome with Puppeteer
- Configurable Proxy - tool for working around IP blocking
- Input schema - define and easily validate a schema for your Actor's input
- Dataset - store structured data where each object stored has the same attributes
- Apify SDK - toolkit for building Actors
How it works
Actor.getInput()
gets the input fromINPUT.json
where the start urls are defined- Create a configuration for proxy servers to be used during the crawling with
Actor.createProxyConfiguration()
to work around IP blocking. Use Apify Proxy or your own Proxy URLs provided and rotated according to the configuration. You can read more about proxy configuration here. - Create an instance of Crawlee's Puppeteer Crawler with
new PuppeteerCrawler()
. You can pass options to the crawler constructor as:proxyConfiguration
- provide the proxy configuration to the crawlerrequestHandler
- handle each request with custom router defined in theroutes.js
file.
- Handle requests with the custom router from
routes.js
file. Read more about custom routing for the Cheerio Crawler here- Create a new router instance with
new createPuppeteerRouter()
- Define default handler that will be called for all URLs that are not handled by other handlers by adding
router.addDefaultHandler(() => { ... })
- Define additional handlers - here you can add your own handling of the page
1router.addHandler('detail', async ({ request, page, log }) => { 2 const title = await page.title(); 3 // You can add your own page handling here 4 5 await Dataset.pushData({ 6 url: request.loadedUrl, 7 title, 8 }); 9});
- Create a new router instance with
crawler.run(startUrls);
start the crawler and wait for its finish
Resources
If you're looking for examples or want to learn more visit:
- Crawlee + Apify Platform guide
- Documentation and examples
- Node.js tutorials in Academy
- How to scale Puppeteer and Playwright
- Video guide on getting data using Apify API
- Integration with Make, GitHub, Zapier, Google Drive, and other apps
- A short guide on how to create Actors using code templates:
Scrape single page with provided URL with Axios and extract data from page's HTML with Cheerio.
A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.
Web scraper example with Crawlee, Playwright and headless Chrome. Playwright is more modern, user-friendly and harder to block than Puppeteer.
Skeleton project that helps you quickly bootstrap `CheerioCrawler` in JavaScript. It's best for developers who already know Apify SDK and Crawlee.
Example of running Cypress tests and saving their results on the Apify platform. JSON results are saved to Dataset, videos to Key-value store.
Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.