Cheerio Scraper avatar
Cheerio Scraper
Try for free

No credit card required

View all Actors
Cheerio Scraper

Cheerio Scraper

apify/cheerio-scraper
Try for free

No credit card required

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Do you want to learn more about this Actor?

Get a demo

The code examples below show how to run the Actor and get its results. To run the code, you need to have an Apify account. Replace <YOUR_API_TOKEN> in the code with your API token, which you can find under Settings > Integrations in Apify Console. Learn more

1from apify_client import ApifyClient
2
3# Initialize the ApifyClient with your Apify API token
4client = ApifyClient("<YOUR_API_TOKEN>")
5
6# Prepare the Actor input
7run_input = {
8    "startUrls": [{ "url": "https://crawlee.dev" }],
9    "globs": [{ "glob": "https://crawlee.dev/*/*" }],
10    "pseudoUrls": [],
11    "excludes": [{ "glob": "/**/*.{png,jpg,jpeg,pdf}" }],
12    "linkSelector": "a[href]",
13    "pageFunction": """async function pageFunction(context) {
14    const { $, request, log } = context;
15
16    // The \"$\" property contains the Cheerio object which is useful
17    // for querying DOM elements and extracting data from them.
18    const pageTitle = $('title').first().text();
19
20    // The \"request\" property contains various information about the web page loaded. 
21    const url = request.url;
22    
23    // Use \"log\" object to print information to actor log.
24    log.info('Page scraped', { url, pageTitle });
25
26    // Return an object with the data extracted from the page.
27    // It will be stored to the resulting dataset.
28    return {
29        url,
30        pageTitle
31    };
32}""",
33    "proxyConfiguration": { "useApifyProxy": True },
34    "initialCookies": [],
35    "additionalMimeTypes": [],
36    "preNavigationHooks": """// We need to return array of (possibly async) functions here.
37// The functions accept two arguments: the \"crawlingContext\" object
38// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`
39// function the crawler calls to navigate..
40[
41    async (crawlingContext, requestAsBrowserOptions) => {
42        // ...
43    }
44]""",
45    "postNavigationHooks": """// We need to return array of (possibly async) functions here.
46// The functions accept a single argument: the \"crawlingContext\" object.
47[
48    async (crawlingContext) => {
49        // ...
50    },
51]""",
52    "customData": {},
53}
54
55# Run the Actor and wait for it to finish
56run = client.actor("apify/cheerio-scraper").call(run_input=run_input)
57
58# Fetch and print Actor results from the run's dataset (if there are any)
59print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
60for item in client.dataset(run["defaultDatasetId"]).iterate_items():
61    print(item)
62
63# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start
Developer
Maintained by Apify
Actor metrics
  • 526 monthly users
  • 75 stars
  • 100.0% runs succeeded
  • 1.5 days response time
  • Created in Apr 2019
  • Modified 3 months ago