Crawlee + Cheerio

A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.

src/main.ts

1// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/)
2import { Actor } from 'apify';
3// Crawlee - web scraping and browser automation library (Read more at https://crawlee.dev)
4import { CheerioCrawler, Dataset } from 'crawlee';
5
6// this is ESM project, and as such, it requires you to specify extensions in your relative imports
7// read more about this here: https://nodejs.org/docs/latest-v18.x/api/esm.html#mandatory-file-extensions
8// note that we need to use `.js` even when inside TS files
9// import { router } from './routes.js';
10
11interface Input {
12    startUrls: {
13        url: string;
14        method?: 'GET' | 'HEAD' | 'POST' | 'PUT' | 'DELETE' | 'TRACE' | 'OPTIONS' | 'CONNECT' | 'PATCH';
15        headers?: Record<string, string>;
16        userData: Record<string, unknown>;
17    }[];
18    maxRequestsPerCrawl: number;
19}
20
21// The init() call configures the Actor for its environment. It's recommended to start every Actor with an init()
22await Actor.init();
23
24// Structure of input is defined in input_schema.json
25const { startUrls = ['https://apify.com'], maxRequestsPerCrawl = 100 } =
26    (await Actor.getInput<Input>()) ?? ({} as Input);
27
28const proxyConfiguration = await Actor.createProxyConfiguration();
29
30const crawler = new CheerioCrawler({
31    proxyConfiguration,
32    maxRequestsPerCrawl,
33    requestHandler: async ({ enqueueLinks, request, $, log }) => {
34        log.info('enqueueing new URLs');
35        await enqueueLinks();
36
37        // Extract title from the page.
38        const title = $('title').text();
39        log.info(`${title}`, { url: request.loadedUrl });
40
41        // Save url and title to Dataset - a table-like storage.
42        await Dataset.pushData({ url: request.loadedUrl, title });
43    },
44});
45
46await crawler.run(startUrls);
47
48// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()
49await Actor.exit();

TypeScript Crawlee & CheerioCrawler template

A template example built with Crawlee to scrape data from a website using Cheerio wrapped into CheerioCrawler.

Included features

Apify SDK - toolkit for building Actors
Crawlee - web scraping and browser automation library
Input schema - define and easily validate a schema for your Actor's input
Dataset - store structured data where each object stored has the same attributes
Cheerio - a fast, flexible & elegant library for parsing and manipulating HTML and XML

How it works

This code is a TypeScript script that uses Crawlee CheerioCrawler framework to crawl a website and extract the data from the crawled URLs with Cheerio. It then stores the website titles in a dataset.

The crawler starts with URLs provided from the input startUrls field defined by the input schema. Number of scraped pages is limited by maxPagesPerCrawl field from input schema.
The crawler uses requestHandler for each URL to extract the data from the page with the Cheerio library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.

Resources

Video tutorial on building a scraper using CheerioCrawler
Written tutorial on building a scraper using CheerioCrawler
Web scraping with Cheerio in 2023
How to scrape a dynamic page using Cheerio
TypeScript vs. JavaScript: which to use for web scraping?
Integration with Zapier, Make, Google Drive and others
Video guide on getting scraped data using Apify API
A short guide on how to build web scrapers using code templates:

Start with TypeScript

Scrape single page with provided URL with Axios and extract data from page's HTML with Cheerio.

Starter

Crawlee + Puppeteer + Chrome

Example of a Puppeteer and headless Chrome web scraper. Headless browsers render JavaScript and are harder to block, but they're slower than plain HTTP.

Crawlee + Playwright + Chrome

Web scraper example with Crawlee, Playwright and headless Chrome. Playwright is more modern, user-friendly and harder to block than Puppeteer.

Crawlee + Playwright + Camoufox

Web scraper example with Crawlee, Playwright and headless Camoufox. Camoufox is a custom stealthy fork of Firefox. Try this template if you're facing anti-scraping challenges.

Playwright + Chrome Test Runner

Example of using the Playwright Test project to run automated website tests in the cloud and display their results. Usable as an API.

Empty TypeScript project

Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.

Import your code Go to store

TypeScript Crawlee & CheerioCrawler template

Included features

How it works

Resources

Related templates

Already have a solution in mind?