Cheerio Scraper avatar
Cheerio Scraper
Try for free

No credit card required

View all Actors
Cheerio Scraper

Cheerio Scraper

apify/cheerio-scraper
Try for free

No credit card required

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Do you want to learn more about this Actor?

Get a demo

The code examples below show how to run the Actor and get its results. To run the code, you need to have an Apify account. Replace <YOUR_API_TOKEN> in the code with your API token, which you can find under Settings > Integrations in Apify Console. Learn more

1# Set API token
2API_TOKEN=<YOUR_API_TOKEN>
3
4# Prepare Actor input
5cat > input.json <<'EOF'
6{
7  "startUrls": [
8    {
9      "url": "https://crawlee.dev"
10    }
11  ],
12  "globs": [
13    {
14      "glob": "https://crawlee.dev/*/*"
15    }
16  ],
17  "pseudoUrls": [],
18  "excludes": [
19    {
20      "glob": "/**/*.{png,jpg,jpeg,pdf}"
21    }
22  ],
23  "linkSelector": "a[href]",
24  "pageFunction": "async function pageFunction(context) {\n    const { $, request, log } = context;\n\n    // The \"$\" property contains the Cheerio object which is useful\n    // for querying DOM elements and extracting data from them.\n    const pageTitle = $('title').first().text();\n\n    // The \"request\" property contains various information about the web page loaded. \n    const url = request.url;\n    \n    // Use \"log\" object to print information to actor log.\n    log.info('Page scraped', { url, pageTitle });\n\n    // Return an object with the data extracted from the page.\n    // It will be stored to the resulting dataset.\n    return {\n        url,\n        pageTitle\n    };\n}",
25  "proxyConfiguration": {
26    "useApifyProxy": true
27  },
28  "initialCookies": [],
29  "additionalMimeTypes": [],
30  "preNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`\n// function the crawler calls to navigate..\n[\n    async (crawlingContext, requestAsBrowserOptions) => {\n        // ...\n    }\n]",
31  "postNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n    async (crawlingContext) => {\n        // ...\n    },\n]",
32  "customData": {}
33}
34EOF
35
36# Run the Actor using an HTTP API
37# See the full API reference at https://docs.apify.com/api/v2
38curl "https://api.apify.com/v2/acts/apify~cheerio-scraper/runs?token=$API_TOKEN" \
39  -X POST \
40  -d @input.json \
41  -H 'Content-Type: application/json'
Developer
Maintained by Apify
Actor metrics
  • 526 monthly users
  • 75 stars
  • 100.0% runs succeeded
  • 1.5 days response time
  • Created in Apr 2019
  • Modified 3 months ago