Cheerio Scraper
No credit card required
Cheerio Scraper
No credit card required
Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.
Do you want to learn more about this Actor?
Get a demoCheerio Scraper is a ready-made solution for crawling websites using plain HTTP requests. It retrieves the HTML pages, parses them using the Cheerio Node.js library and lets you extract any data from them. Fast.
Cheerio is a server-side version of the popular jQuery library. It does not require a browser but instead constructs a DOM from an HTML string. It then provides the user an API to work with that DOM.
Cheerio Scraper is ideal for scraping web pages that do not rely on client-side JavaScript to serve their content and can be up to 20 times faster than using a full-browser solution such as Puppeteer.
If you're unfamiliar with web scraping or web development in general, you might prefer to start with Scraping with Web Scraper tutorial from the Apify documentation and then continue with Scraping with Cheerio Scraper, a tutorial which will walk you through all the steps and provide a number of examples.
Cost of usage
You can find the average usage cost for this actor on the pricing page under the Which plan do I need?
section. Cheerio Scraper is equivalent to Simple HTML pages
while Web Scraper, Puppeteer Scraper and Playwright Scraper are equivalent to Full web pages
. These cost estimates are based on averages and might be lower or higher depending on how heavy the pages you scrape are.
Usage
To get started with Cheerio Scraper, you only need two things. First, tell the scraper which web pages it should load. Second, tell it how to extract data from each page.
The scraper starts by loading the pages specified in the Start URLs field. You can make the scraper follow page links on the fly by setting a Link selector, Glob Patterns and/or Pseudo-URLs to tell the scraper which links it should add to the crawling queue. This is useful for the recursive crawling of entire websites, e.g. to find all products in an online store.
To tell the scraper how to extract data from web pages, you need to provide a Page function. This is JavaScript code that is executed for every web page loaded. Since the scraper does not use the full web browser, writing the Page function is equivalent to writing server-side Node.js code - it uses the server-side library Cheerio.
In summary, Cheerio Scraper works as follows:
- Adds each Start URL to the crawling queue.
- Fetches the first URL from the queue and constructs a DOM from the fetched HTML string.
- Executes the Page function on the loaded page and saves its results.
- Optionally, finds all links from the page using the Link selector. If a link matches any of the Glob Patterns and/or Pseudo-URLs and has not yet been visited, adds it to the queue.
- If there are more items in the queue, repeats step 2, otherwise finishes.
Cheerio Scraper has a number of advanced configuration settings to improve performance, set cookies for login to websites, limit the number of records, etc. See their tooltips for more information.
Under the hood, Cheerio Scraper is built using the CheerioCrawler
class
from Crawlee. If you'd like to learn more about the inner workings of the scraper, see the respective documentation.
Content types
By default, Cheerio Scraper only processes web pages with the text/html
, application/json
, application/xml
, application/xhtml+xml
MIME content types (as reported by the Content-Type
HTTP header),
and skips pages with other content types.
If you want the crawler to process other content types,
use the Additional MIME types (additionalMimeTypes
) input option.
Note that while the default Accept
HTTP header will allow any content type to be received,
HTML and XML are preferred over JSON and other types. Thus, if you're allowing additional MIME
types, and you're still receiving invalid responses, be sure to override the Accept
HTTP header setting in the requests from the scraper,
either in Start URLs, Pseudo URLs or in the Prepare request function.
The web pages with various content types are parsed differently and
thus the context
parameter of the Page function will have different values:
Content types | context.body | context.$ | context.json |
---|---|---|---|
text/html , application/xhtml+xml , application/xml | String | Function | null |
application/json | String | null | Object |
Other | Buffer | null | null |
The Content-Type
HTTP header of the web page is parsed using the
content-type NPM package
and the result is stored in the context.contentType
object.
Limitations
The actor does not employ a full-featured web browser such as Chromium or Firefox, so it will not be sufficient for web pages that render their content dynamically using client-side JavaScript. To scrape such sites, you might prefer to use Web Scraper (apify/web-scraper
), which loads pages in a full browser and renders dynamic content.
Since Cheerio Scraper's Page function is executed in the context of the server, it only supports server-side code running in Node.js. If you need to combine client- and server-side libraries in Chromium using the Puppeteer library, you might prefer to use
Puppeteer Scraper (apify/puppeteer-scraper
). If you prefer Firefox and/or Playwright, check out Playwright Scraper (apify/playwright-scraper
). For even more flexibility and control, you might develop a new actor from scratch in Node.js using Apify SDK and Crawlee.
In the Page function and Prepare request function,
you can only use NPM modules that are already installed in this actor.
If you require other modules for your scraping, you'll need to develop a completely new actor.
You can use the CheerioCrawler
class
from Crawlee to get most of the functionality of Cheerio Scraper out of the box.
Input configuration
As input, Cheerio Scraper actor accepts a number of configurations. These can be entered either manually in the user interface in Apify Console, or programmatically in a JSON object using the Apify API. For a complete list of input fields and their types, please visit the Input tab.
Start URLs
The Start URLs (startUrls
) field represents the initial list of pages that the scraper will visit.
You can either enter the URLs manually one by one, upload them in a CSV file, or link URLs from a Google Sheet document.
Each URL must start with either a http://
or https://
protocol prefix.
The scraper supports adding new URLs to scrape on the fly, either using the Link selector and Glob Patterns/Pseudo-URLs options or by calling context.enqueueRequest()
inside the Page function.
Optionally, each URL can be associated with custom user data - a JSON object that can be referenced from
your JavaScript code in the Page function under context.request.userData
.
This is useful for determining which start URL is currently loaded, in order to perform some page-specific actions. For example, when crawling an online store, you might want to perform different actions on a page listing the products vs. a product detail page. For details, see the Web scraping tutorial in the Apify documentation.
Link selector
The Link selector (linkSelector
) field contains a CSS selector that is used to find links to other web pages, i.e. <a>
elements with the href
attribute. On every page loaded, the scraper looks for all links matching the Link selector. It checks that the target URL matches one of the Glob Patterns/Pseudo-URLs, and if so then adds the URL to the request queue, to be loaded by the scraper later.
By default, new scrapers are created with the following selector that matches all links:
a[href]
If the Link selector is empty, page links are ignored, and the scraper only loads pages that were specified in the Start URLs input or that were manually added to the request queue by calling context.enqueueRequest()
in the Page function.
Glob Patterns
The Glob Patterns (globs
) field specifies which types of URLs found by Link selector should be added to the request queue.
A glob pattern is simply a string with wildcard characters.
For example, a glob pattern http://www.example.com/pages/**/*
will match all the
following URLs:
http://www.example.com/pages/deeper-level/page
http://www.example.com/pages/my-awesome-page
http://www.example.com/pages/something
Note that you don't need to use the Glob Patterns setting at all, because you can completely control which pages the scraper will access by calling await context.enqueueRequest()
from the Page function.
Pseudo-URLs
The Pseudo-URLs (pseudoUrls
) field specifies which types of URLs found by Link selector should be added to the request queue.
A pseudo-URL is simply a URL with special directives enclosed in []
brackets.
Currently, the only supported directive is [regexp]
, which defines
a JavaScript-style regular expression to match against the URL.
For example, a pseudo-URL http://www.example.com/pages/[(\w|-)*]
will match all the
following URLs:
http://www.example.com/pages/
http://www.example.com/pages/my-awesome-page
http://www.example.com/pages/something
If either "[
" or "]
" are part of the normal query string, the symbol must be encoded as [\x5B]
or [\x5D]
, respectively. For example, the following pseudo-URL:
http://www.example.com/search?do[\x5B]load[\x5D]=1
will match the URL:
http://www.example.com/search?do[load]=1
Optionally, each pseudo-URL can be associated with user data that can be referenced from your Page function
using context.request.label
to determine which kind of page is currently loaded in the browser.
Note that you don't need to use the Pseudo-URLs setting at all,
because you can completely control which pages the scraper will access by calling await context.enqueueRequest()
from the Page function.
Page function
The Page function (pageFunction
) field contains a single JavaScript function that enables the user to extract data from the web page, access its DOM, add new URLs to the request queue, and otherwise control Cheerio Scraper's operation.
Example:
1async function pageFunction(context) { 2 const { $, request, log } = context; 3 4 // The "$" property contains the Cheerio object which is useful 5 // for querying DOM elements and extracting data from them. 6 const pageTitle = $('title').first().text(); 7 8 // The "request" property contains various information about the web page loaded. 9 const url = request.url; 10 11 // Use "log" object to print information to actor log. 12 log.info('Page scraped', { url, pageTitle }); 13 14 // Return an object with the data extracted from the page. 15 // It will be stored to the resulting dataset. 16 return { 17 url, 18 pageTitle 19 }; 20}
The code runs in Node.js 16 and the function accepts a single argument, the context
object, whose properties are listed below.
The return value of the page function is an object (or an array of objects) representing the data extracted from the web page. The return value must be stringify-able to JSON, i.e. it can only contain basic types and no circular references. If you prefer not to extract any data from the page and skip it in the clean results, simply return null
or undefined
.
The Page function supports the JavaScript ES6 syntax and is asynchronous, which means you can use the await
keyword to wait for background operations to finish. To learn more about async
functions,
visit the Mozilla documentation.
Properties of the context
object:
-
$: Function
A reference to the Cheerio's function representing the root scope of the DOM of the current HTML page.
This function is the starting point for traversing the DOM document and extracting data from it. Like with jQuery, it is the primary method for selecting elements in the document, but unlike jQuery it is built on top of the
css-select
library, which implements most of theSizzle
selectors.For more information, see the Cheerio documentation.
Example:
1<ul id="movies"> 2 <li class="fun-movie">Fun Movie</li> 3 <li class="sad-movie">Sad Movie</li> 4 <li class="horror-movie">Horror Movie</li> 5</ul>
1$('#movies', '.fun-movie').text(); 2//=> Fun Movie 3$('ul .sad-movie').attr('class'); 4//=> sad-movie 5$('li[class=horror-movie]').html(); 6//=> Horror Movie
-
Actor: Object
A reference to the Actor object from Apify SDK. This is equivalent to:
import { Actor } from 'apify';
-
Apify: Object
A reference to the Actor object from Apify SDK. Included for backward compatibility.
-
crawler: Object
A reference to the
CheerioCrawler
object, see Crawlee docs for more information. -
body: String|Buffer
The body from the target web page. If the web page is in HTML or XML format, the
body
will be a string that contains the HTML or XML content. In other cases, thebody
with be a Buffer. If you need to process thebody
as a string, you can use the information fromcontentType
property to convert the binary data into a string.Example:
const stringBody = context.body.toString(context.contentType.encoding)
-
cheerio: Object
Reference to the
Cheerio
module. Being the server-side version of the jQuery library, Cheerio features a very similar API with nearly identical selector implementation. This means DOM traversing, manipulation, querying, and data extraction are just as easy as with jQuery.This is equivalent to:
import * as cheerio from 'cheerio';
-
contentType: Object
The
Content-Type
HTTP header parsed into an object with 2 properties,type
andencoding
.Example:
1// Content-Type: application/json; charset=utf-8 2const mimeType = contentType.type; // "application/json" 3const encoding = contentType.encoding; // "utf-8"
-
customData: Object
Contains the object provided in the Custom data (
customData
) input field. This is useful for passing dynamic parameters to your Cheerio Scraper using API. -
enqueueRequest(request, [options]): AsyncFunction
Adds a new URL to the request queue, if it wasn't already there.
The
request
parameter is an object containing details of the request, with properties such asurl
,userData
,headers
etc. For the full list of the supported properties, see theRequest
object's constructor in Crawlee's documentation.The optional
options
parameter is an object with additional options. Currently, it only supports theforefront
boolean flag. Iftrue
, the request is added to the beginning of the queue. By default, requests are added to the end.Example:
1await context.enqueueRequest({ url: 'https://www.example.com' }); 2await context.enqueueRequest({ url: 'https://www.example.com/first' }, { forefront: true });
-
env: Object
A map of all relevant values set by the Apify platform to the actor run via the
APIFY_
environment variable. For example, here you can find information such as actor run ID, timeouts, actor run memory, etc. For the full list of available values, see theActor.getEnv()
function in the Apify SDK documentation.Example:
console.log(`Actor run ID: ${context.env.actorRunId}`);
-
getValue(key): AsyncFunction
Gets a value from the default key-value store associated with the actor run. The key-value store is useful for persisting named data records, such as state objects, files, etc. The function is very similar to the
Actor.getValue()
function in Apify SDK.To set the value, use the dual function
context.setValue(key, value)
.Example:
1const value = await context.getValue('my-key'); 2console.dir(value);
-
globalStore: Object
Represents an in-memory store that can be used to share data across page function invocations, e.g. state variables, API responses, or other data. The
globalStore
object has an interface similar to JavaScript'sMap
object, with a few important differences:- All
globalStore
functions areasync
; useawait
when calling them. - Keys must be strings and values must be JSON stringify-able.
- The
forEach()
function is not supported.
Note that stored data is not persisted. If the actor run is restarted or migrated to another worker server, the content of
globalStore
is reset. Therefore, never depend on a specific value to be present in the store.Example:
1let movies = await context.globalStore.get('cached-movies'); 2if (!movies) { 3 movies = await fetch('http://example.com/movies.json'); 4 await context.globalStore.set('cached-movies', movies); 5} 6console.dir(movies);
- All
-
input: Object
An object containing the actor run input, i.e. Cheerio Scraper's configuration. Each page function invocation gets a fresh copy of the
input
object, so changing its properties has no effect. -
json: Object
The parsed object from a JSON string if the response contains the content type
application/json
. -
log: Object
An object containing logging functions, with the same interface as provided by the
crawlee.utils.log
object in Crawlee. The log messages are written directly to the actor run log, which is useful for monitoring and debugging. Note thatlog.debug()
only logs messages if the Debug log input setting is set.Example:
1const log = context.log; 2log.debug('Debug message', { hello: 'world!' }); 3log.info('Information message', { all: 'good' }); 4log.warning('Warning message'); 5log.error('Error message', { details: 'This is bad!' }); 6try { 7 throw new Error('Not good!'); 8} catch (e) { 9 log.exception(e, 'Exception occurred', { details: 'This is really bad!' }); 10}
-
saveSnapshot(): AsyncFunction
Saves the full HTML of the current page to the key-value store associated with the actor run, under the
SNAPSHOT-BODY
key. This feature is useful when debugging your scraper.Note that each snapshot overwrites the previous one and the
saveSnapshot()
calls are throttled to at most one call in two seconds, in order to avoid excess consumption of resources and slowdown of the actor. -
setValue(key, data, options): AsyncFunction
Sets a value to the default key-value store associated with the actor run. The key-value store is useful for persisting named data records, such as state objects, files, etc. The function is very similar to the
Actor.setValue()
function in Apify SDK.To get the value, use the dual function
context.getValue(key)
.Example:
await context.setValue('my-key', { hello: 'world' });
-
skipLinks(): AsyncFunction
Calling this function ensures that page links from the current page will not be added to the request queue, even if they match the Link selector and/or Glob Patterns/Pseudo-URLs settings. This is useful to programmatically stop recursive crawling, e.g. if you know there are no more interesting links on the current page to follow.
-
request: Object
An object containing information about the currently loaded web page, such as the URL, number of retries, a unique key, etc. Its properties are equivalent to the
Request
object in Crawlee. -
response: Object
An object containing information about the HTTP response from the web server. Currently, it only contains the
status
andheaders
properties. For example:1{ 2 // HTTP status code 3 status: 200, 4 5 // HTTP headers 6 headers: { 7 'content-type': 'text/html; charset=utf-8', 8 'date': 'Wed, 06 Nov 2019 16:01:53 GMT', 9 'cache-control': 'no-cache', 10 'content-encoding': 'gzip', 11 } 12}
Proxy configuration
The Proxy configuration (proxyConfiguration
) option enables you to set
proxies that will be used by the scraper in order to prevent its detection by target web pages.
You can use both the Apify Proxy and custom HTTP or SOCKS5 proxy servers.
Proxy is required to run the scraper. The following table lists the available options of the proxy configuration setting:
Apify Proxy (automatic) | The scraper will load all web pages using the Apify Proxy in automatic mode. In this mode, the proxy uses all proxy groups that are available to the user. For each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname in order to reduce the chance of detection by the web page. You can view the list of available proxy groups on the Proxy page in Apify Console. |
---|---|
Apify Proxy (selected groups) | The scraper will load all web pages using the Apify Proxy with specific groups of target proxy servers. |
Custom proxies |
The scraper will use a custom list of proxy servers.
The proxies must be specified in the Example:
|
The proxy configuration can be set programmatically when calling the actor using the API
by setting the proxyConfiguration
field.
It accepts a JSON object with the following structure:
1{ 2 // Indicates whether to use the Apify Proxy or not. 3 "useApifyProxy": Boolean, 4 5 // Array of Apify Proxy groups, only used if "useApifyProxy" is true. 6 // If missing or null, the Apify Proxy will use automatic mode. 7 "apifyProxyGroups": String[], 8 9 // Array of custom proxy URLs, in "scheme://user:password@host:port" format. 10 // If missing or null, custom proxies are not used. 11 "proxyUrls": String[], 12}
Advanced Configuration
Pre-navigation hooks
This is an array of functions that will be executed BEFORE the main pageFunction
is run. A similar context
object is passed into each of these functions as is passed into the pageFunction
; however, a second gotOptions
object is also passed in.
The available options can be seen here:
1preNavigationHooks: [ 2 async ({ id, request, session, proxyInfo, customData, Actor }, { url, method, headers, proxyUrl }) => {} 3]
Check out the docs for Pre-navigation hooks and the CheerioHook type for more info regarding the objects passed into these functions. The available properties are extended with Actor
(alternatively Apify
) and customData
in this scraper.
Post-navigation hooks
An array of functions that will be executed AFTER the main pageFunction
is run. The only available parameter is the CrawlingContext object. The available properties are extended with Actor
(alternatively Apify
) and customData
in this scraper.
1postNavigationHooks: [ 2 async ({ id, request, session, proxyInfo, response, customData, Actor }) => {} 3]
Check out the docs for Pre-navigation hooks for more info regarding the objects passed into these functions.
Results
The scraping results returned by Page function are stored in the default dataset associated with the actor run, from where you can export them to formats such as JSON, XML, CSV or Excel. For each object returned by the Page function, Cheerio Scraper pushes one record into the dataset and extends it with metadata such as the URL of the web page where the results come from.
For example, if your page function returned the following object:
1{ 2 message: 'Hello world!' 3}
The full object stored in the dataset will look as follows
(in JSON format, including the metadata fields #error
and #debug
):
1{ 2 "message": "Hello world!", 3 "#error": false, 4 "#debug": { 5 "requestId": "fvwscO2UJLdr10B", 6 "url": "https://www.example.com/", 7 "loadedUrl": "https://www.example.com/", 8 "method": "GET", 9 "retryCount": 0, 10 "errorMessages": null, 11 "statusCode": 200 12 } 13}
To download the results, call the Get dataset items API endpoint:
https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json
where [DATASET_ID]
is the ID of the actor's run dataset, in which you can find the Run object returned when starting the actor. Alternatively, you'll find the download links for the results in Apify Console.
To skip the #error
and #debug
metadata fields from the results and not include empty result records,
simply add the clean=true
query parameter to the API URL, or select the Clean items option when downloading the dataset in Apify Console.
To get the results in other formats, set the format
query parameter to xml
, xlsx
, csv
, html
, etc.
For more information, see Datasets in documentation
or the Get dataset items
endpoint in Apify API reference.
Additional resources
Congratulations! You've learned how Cheerio Scraper works. You might also want to see these other resources:
- Web scraping tutorial - An introduction to web scraping with Apify.
- Scraping with Cheerio Scraper - A step-by-step tutorial on how to use Cheerio Scraper, with a detailed explanation and examples.
- Web Scraper (apify/web-scraper) - Apify's basic tool for web crawling and scraping. It uses a full Chrome browser to render dynamic content. A similar web scraping actor to Puppeteer Scraper, but is simpler to use and only runs in the context of the browser. Uses the Puppeteer library.
- Puppeteer Scraper (apify/puppeteer-scraper) - An actor similar to Web Scraper, which provides lower-level control of the underlying Puppeteer library and the ability to use server-side libraries.
- Playwright Scraper (apify/playwright-scraper) - A similar web scraping actor to Puppeteer Scraper, but using the Playwright library instead.
- Actors documentation - Documentation for the Apify Actors cloud computing platform.
- Apify SDK documentation - Learn more about the tools required to run your own Apify actors.
- Crawlee documentation - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
Actor Metrics
441 monthly users
-
84 stars
>99% runs succeeded
Created in Apr 2019
Modified a month ago