One‑Page HTML Scraper with Cheerio

Scrape single page with provided URL with Axios and extract data from page's HTML with Cheerio.

src/main.ts

1// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/).
2import { Actor, log } from 'apify';
3// Axios - Promise based HTTP client for the browser and node.js (Read more at https://axios-http.com/docs/intro).
4import axios from 'axios';
5// Cheerio - The fast, flexible & elegant library for parsing and manipulating HTML and XML (Read more at https://cheerio.js.org/).
6import * as cheerio from 'cheerio';
7
8// this is ESM project, and as such, it requires you to specify extensions in your relative imports
9// read more about this here: https://nodejs.org/docs/latest-v18.x/api/esm.html#mandatory-file-extensions
10// note that we need to use `.js` even when inside TS files
11// import { router } from './routes.js';
12
13// The init() call configures the Actor to correctly work with the Apify-provided environment - mainly the storage infrastructure. It is necessary that every Actor performs an init() call.
14await Actor.init();
15
16interface Input {
17    url: string;
18}
19// Structure of input is defined in input_schema.json
20const input = await Actor.getInput<Input>();
21if (!input) throw new Error('Input is missing!');
22const { url } = input;
23
24// Fetch the HTML content of the page.
25const response = await axios.get(url);
26
27// Parse the downloaded HTML with Cheerio to enable data extraction.
28const $ = cheerio.load(response.data);
29
30// Extract all headings from the page (tag name and text).
31const headings: { level: string; text: string }[] = [];
32$('h1, h2, h3, h4, h5, h6').each((_i, element) => {
33    const headingObject = {
34        level: $(element).prop('tagName')!.toLowerCase(),
35        text: $(element).text(),
36    };
37    log.info('Extracted heading', headingObject);
38    headings.push(headingObject);
39});
40
41// Save headings to Dataset - a table-like storage.
42await Actor.pushData(headings);
43
44// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit().
45await Actor.exit();

Scrape single-page in TypeScript template

A template for scraping data from a single web page in TypeScript (Node.js). The URL of the web page is passed in via input, which is defined by the input schema. The template uses the Axios client to get the HTML of the page and the Cheerio library to parse the data from it. The data are then stored in a dataset where you can easily access them.

The scraped data in this template are page headings but you can easily edit the code to scrape whatever you want from the page.

Included features

Apify SDK - a toolkit for building Actors
Input schema - define and easily validate a schema for your Actor's input
Dataset - store structured data where each object stored has the same attributes
Axios client - promise-based HTTP Client for Node.js and the browser
Cheerio - library for parsing and manipulating HTML and XML

How it works

Actor.getInput() gets the input where the page URL is defined
axios.get(url) fetches the page
cheerio.load(response.data) loads the page data and enables parsing the headings

This parses the headings from the page and here you can edit the code to parse whatever you need from the page

$("h1, h2, h3, h4, h5, h6").each((_i, element) => {...});

Actor.pushData(headings) stores the headings in the dataset

Resources

Web scraping in Node.js with Axios and Cheerio
Web scraping with Cheerio in 2023
Video tutorial on building a scraper using CheerioCrawler
Written tutorial on building a scraper using CheerioCrawler
Integration with Zapier, Make, Google Drive, and others
Video guide on getting scraped data using Apify API
A short guide on how to build web scrapers using code templates:

Crawlee + Cheerio

A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.

Starter

Crawlee + Puppeteer + Chrome

Example of a Puppeteer and headless Chrome web scraper. Headless browsers render JavaScript and are harder to block, but they're slower than plain HTTP.

Crawlee + Playwright + Chrome

Web scraper example with Crawlee, Playwright and headless Chrome. Playwright is more modern, user-friendly and harder to block than Puppeteer.

Crawlee + Playwright + Camoufox

Web scraper example with Crawlee, Playwright and headless Camoufox. Camoufox is a custom stealthy fork of Firefox. Try this template if you're facing anti-scraping challenges.

Playwright + Chrome Test Runner

Example of using the Playwright Test project to run automated website tests in the cloud and display their results. Usable as an API.

Empty TypeScript project

Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.

Already have a solution in mind?

Sign up for a free Apify account and deploy your code to the platform in just a few minutes! If you want a head start without coding it yourself, browse our Store of existing solutions.

Import your code Go to store

Scrape single-page in TypeScript template

Included features

How it works

Resources

Related templates

Already have a solution in mind?