Example Sitemap Cheerio avatar
Example Sitemap Cheerio
Try for free

No credit card required

View all Actors
Example Sitemap Cheerio

Example Sitemap Cheerio

jancurn/example-sitemap-cheerio
Try for free

No credit card required

An example actor that first downloads a sitemap in XML format and the crawls each page from the sitemap using the fast CheerioCrawler from Apify SDK.

Dockerfile

1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY  . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

main.js

1const Apify = require('apify');
2const cheerio = require('cheerio');
3
4Apify.main(async () => {
5    const input = await Apify.getInput();
6    // Download sitemap
7    const xml = await Apify.utils.requestAsBrowser({
8        url: input?.url || 'http://beachwaver.com/sitemap_products_1.xml',
9        headers: {
10            'User-Agent': 'curl/7.54.0'
11        }
12    });
13    
14    // Parse sitemap and create RequestList from it
15    const $ = cheerio.load(xml.toString());
16    const sources = [];
17    $('loc').each(function (val) {
18        const url = $(this).text().trim();
19        sources.push({
20            url,
21            headers: {
22                // NOTE: Otherwise the target doesn't allow to download the page!
23                'User-Agent': 'curl/7.54.0',
24            }
25        });
26    });
27    console.log(`Found ${sources.length} URLs in the sitemap`)
28    const requestList = new Apify.RequestList({
29        sources,
30    });
31    await requestList.initialize();
32    
33    // Crawl each page from sitemap
34    const crawler = new Apify.CheerioCrawler({
35        requestList,
36        handlePageFunction: async ({ $, request }) => {
37            console.log(`Processing ${request.url}...`);
38            await Apify.pushData({
39                url: request.url,
40                title: $('title').text(),
41            });
42        },
43    });
44
45    await crawler.run();
46    console.log('Done.');
47});

package.json

1{
2    "name": "apify-project",
3    "version": "0.0.1",
4    "description": "",
5    "author": "It's not you it's me",
6    "license": "ISC",
7    "dependencies": {
8        "apify": "2.2.2",
9        "cheerio": "latest"
10    },
11    "scripts": {
12        "start": "node main.js"
13    }
14}
Developer
Maintained by Community
Actor metrics
  • 2 monthly users
  • 0.0 days response time
  • Created in Jan 2019
  • Modified over 1 year ago