Crawlee + Puppeteer + Chrome

Example of a Puppeteer and headless Chrome web scraper. Headless browsers render JavaScript and are harder to block, but they're slower than plain HTTP.

src/main.ts

src/routes.ts

1// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/).
2import { Actor } from 'apify';
3// Web scraping and browser automation library (Read more at https://crawlee.dev)
4import { PuppeteerCrawler } from 'crawlee';
5
6import { router } from './routes.js';
7
8// The init() call configures the Actor for its environment. It's recommended to start every Actor with an init().
9await Actor.init();
10
11interface Input {
12    startUrls: {
13        url: string;
14        method?: 'GET' | 'HEAD' | 'POST' | 'PUT' | 'DELETE' | 'TRACE' | 'OPTIONS' | 'CONNECT' | 'PATCH';
15        headers?: Record<string, string>;
16        userData: Record<string, unknown>;
17    }[];
18}
19// Define the URLs to start the crawler with - get them from the input of the Actor or use a default list.
20const { startUrls = ['https://apify.com'] } = (await Actor.getInput<Input>()) ?? {};
21
22// Create a proxy configuration that will rotate proxies from Apify Proxy.
23const proxyConfiguration = await Actor.createProxyConfiguration();
24
25// Create a PuppeteerCrawler that will use the proxy configuration and and handle requests with the router from routes.ts file.
26const crawler = new PuppeteerCrawler({
27    proxyConfiguration,
28    requestHandler: router,
29    launchContext: {
30        launchOptions: {
31            args: [
32                '--disable-gpu', // Mitigates the "crashing GPU process" issue in Docker containers
33                '--no-sandbox', // Mitigates the "sandboxed" process issue in Docker containers
34            ],
35        },
36    },
37});
38
39// Run the crawler with the start URLs and wait for it to finish.
40await crawler.run(startUrls);
41
42// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit().
43await Actor.exit();

TypeScript PuppeteerCrawler Actor template

This template is a production ready boilerplate for developing with PuppeteerCrawler. The PuppeteerCrawler provides a simple framework for parallel crawling of web pages using headless Chrome with Puppeteer. Since PuppeteerCrawler uses headless Chrome to download web pages and extract data, it is useful for crawling of websites that require to execute JavaScript.

If you're looking for examples or want to learn more visit:

Included features

Puppeteer Crawler - simple framework for parallel crawling of web pages using headless Chrome with Puppeteer
Configurable Proxy - tool for working around IP blocking
Input schema - define and easily validate a schema for your Actor's input
Dataset - store structured data where each object stored has the same attributes
Apify SDK - toolkit for building Actors

How it works

Actor.getInput() gets the input from INPUT.json where the start urls are defined
Create a configuration for proxy servers to be used during the crawling with Actor.createProxyConfiguration() to work around IP blocking. Use Apify Proxy or your own Proxy URLs provided and rotated according to the configuration. You can read more about proxy configuration here.
Create an instance of Crawlee's Puppeteer Crawler with new PuppeteerCrawler(). You can pass options to the crawler constructor as:
- proxyConfiguration - provide the proxy configuration to the crawler
- requestHandler - handle each request with custom router defined in the routes.ts file.

Handle requests with the custom router from routes.ts file. Read more about custom routing for the Cheerio Crawler here

Create a new router instance with new createPuppeteerRouter()
Define default handler that will be called for all URLs that are not handled by other handlers by adding router.addDefaultHandler(() => { ... })

Define additional handlers - here you can add your own handling of the page

router.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    // You can add your own page handling here

    await Dataset.pushData({
        url: request.loadedUrl,
        title,
    });
});

crawler.run(startUrls); start the crawler and wait for its finish

Resources

If you're looking for examples or want to learn more visit:

Crawlee + Apify Platform guide
Documentation and examples
Node.js tutorials in Academy
How to scale Puppeteer and Playwright
Video guide on getting data using Apify API
Integration with Make, GitHub, Zapier, Google Drive, and other apps
A short guide on how to build web scrapers using code templates:

Start with TypeScript

Scrape single page with provided URL with Axios and extract data from page's HTML with Cheerio.

Starter

Crawlee + Cheerio

A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.

Crawlee + Playwright + Chrome

Web scraper example with Crawlee, Playwright and headless Chrome. Playwright is more modern, user-friendly and harder to block than Puppeteer.

Crawlee + Playwright + Camoufox

Web scraper example with Crawlee, Playwright and headless Camoufox. Camoufox is a custom stealthy fork of Firefox. Try this template if you're facing anti-scraping challenges.

Playwright + Chrome Test Runner

Example of using the Playwright Test project to run automated website tests in the cloud and display their results. Usable as an API.

Empty TypeScript project

Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.

Import your code Go to store

TypeScript PuppeteerCrawler Actor template

Included features

How it works

Resources

Related templates

Already have a solution in mind?