Back to template gallery

Crawlee + Cheerio

A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.

Language

javascript

Tools

nodejs

crawlee

cheerio

Use cases

Web scraping

src/main.js

1// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/)
2import { Actor } from 'apify';
3// Crawlee - web scraping and browser automation library (Read more at https://crawlee.dev)
4import { CheerioCrawler, Dataset } from 'crawlee';
5// this is ESM project, and as such, it requires you to specify extensions in your relative imports
6// read more about this here: https://nodejs.org/docs/latest-v18.x/api/esm.html#mandatory-file-extensions
7// import { router } from './routes.js';
8
9// The init() call configures the Actor for its environment. It's recommended to start every Actor with an init()
10await Actor.init();
11
12// Structure of input is defined in input_schema.json
13const {
14    startUrls = ['https://crawlee.dev'],
15    maxRequestsPerCrawl = 100,
16} = await Actor.getInput() ?? {};
17
18const proxyConfiguration = await Actor.createProxyConfiguration();
19
20const crawler = new CheerioCrawler({
21    proxyConfiguration,
22    maxRequestsPerCrawl,
23    async requestHandler({ enqueueLinks, request, $, log }) {
24        log.info('enqueueing new URLs');
25        await enqueueLinks();
26
27        // Extract title from the page.
28        const title = $('title').text();
29        log.info(`${title}`, { url: request.loadedUrl });
30
31        // Save url and title to Dataset - a table-like storage.
32        await Dataset.pushData({ url: request.loadedUrl, title });
33    },
34});
35
36await crawler.run(startUrls);
37
38// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()
39await Actor.exit();

JavaScript Crawlee & CheerioCrawler template

This template example was built with Crawlee to scrape data from a website using Cheerio wrapped into CheerioCrawler.

Included features

  • Apify SDK - toolkit for building Actors
  • Crawlee - web scraping and browser automation library
  • Input schema - define and easily validate a schema for your Actor's input
  • Dataset - store structured data where each object stored has the same attributes
  • Cheerio - a fast, flexible & elegant library for parsing and manipulating HTML and XML

How it works

This code is a JavaScript script that uses Cheerio to scrape data from a website. It then stores the website titles in a dataset.

  • The crawler starts with URLs provided from the input startUrls field defined by the input schema. Number of scraped pages is limited by maxPagesPerCrawl field from the input schema.
  • The crawler uses requestHandler for each URL to extract the data from the page with the Cheerio library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.

Resources

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.