
Example Sitemap Cheerio
Pricing
Pay per usage
Go to Store

Example Sitemap Cheerio
An example actor that first downloads a sitemap in XML format and the crawls each page from the sitemap using the fast CheerioCrawler from Apify SDK.
0.0 (0)
Pricing
Pay per usage
2
Total users
38
Monthly users
5
Runs succeeded
>99%
Last modified
3 years ago
Dockerfile
# This is a template for a Dockerfile used to run acts in Actor system.# The base image name below is set during the act build, based on user settings.# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/userFROM apify/actor-node
# Second, copy just package.json and package-lock.json since it should be# the only file that affects "npm install" in the next step, to speed up the buildCOPY package*.json ./
# Install NPM packages, skip optional and development dependencies to# keep the image small. Avoid logging too much and print the dependency# tree for debuggingRUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version
# Copy source code to container# Do this in the last step, to have fast build if only the source code changedCOPY . ./
# NOTE: The CMD is already defined by the base image.# Uncomment this for local node inspector debugging:# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]
main.js
1const Apify = require('apify');2const cheerio = require('cheerio');3
4Apify.main(async () => {5 const input = await Apify.getInput();6 // Download sitemap7 const xml = await Apify.utils.requestAsBrowser({8 url: input?.url || 'http://beachwaver.com/sitemap_products_1.xml',9 headers: {10 'User-Agent': 'curl/7.54.0'11 }12 });13 14 // Parse sitemap and create RequestList from it15 const $ = cheerio.load(xml.toString());16 const sources = [];17 $('loc').each(function (val) {18 const url = $(this).text().trim();19 sources.push({20 url,21 headers: {22 // NOTE: Otherwise the target doesn't allow to download the page!23 'User-Agent': 'curl/7.54.0',24 }25 });26 });27 console.log(`Found ${sources.length} URLs in the sitemap`)28 const requestList = new Apify.RequestList({29 sources,30 });31 await requestList.initialize();32 33 // Crawl each page from sitemap34 const crawler = new Apify.CheerioCrawler({35 requestList,36 handlePageFunction: async ({ $, request }) => {37 console.log(`Processing ${request.url}...`);38 await Apify.pushData({39 url: request.url,40 title: $('title').text(),41 });42 },43 });44
45 await crawler.run();46 console.log('Done.');47});
package.json
{ "name": "apify-project", "version": "0.0.1", "description": "", "author": "It's not you it's me", "license": "ISC", "dependencies": { "apify": "2.2.2", "cheerio": "latest" }, "scripts": { "start": "node main.js" }}