Example Sitemap Cheerio

No credit card required

Example Sitemap Cheerio

Example Sitemap Cheerio

jancurn/example-sitemap-cheerio

No credit card required

An example actor that first downloads a sitemap in XML format and the crawls each page from the sitemap using the fast CheerioCrawler from Apify SDK.

Dockerfile

1# This is a template for a Dockerfile used to run acts in Actor system. 2# The base image name below is set during the act build, based on user settings. 3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user 4FROM apify/actor-node 5 6# Second, copy just package.json and package-lock.json since it should be 7# the only file that affects "npm install" in the next step, to speed up the build 8COPY package*.json ./ 9 10# Install NPM packages, skip optional and development dependencies to 11# keep the image small. Avoid logging too much and print the dependency 12# tree for debugging 13RUN npm --quiet set progress=false \ 14 && npm install --only=prod --no-optional \ 15 && echo "Installed NPM packages:" \ 16 && (npm list --all || true) \ 17 && echo "Node.js version:" \ 18 && node --version \ 19 && echo "NPM version:" \ 20 && npm --version 21 22# Copy source code to container 23# Do this in the last step, to have fast build if only the source code changed 24COPY . ./ 25 26# NOTE: The CMD is already defined by the base image. 27# Uncomment this for local node inspector debugging: 28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ] 29

main.js

1const Apify = require('apify'); 2const cheerio = require('cheerio'); 3 4Apify.main(async () => { 5 const input = await Apify.getInput(); 6 // Download sitemap 7 const xml = await Apify.utils.requestAsBrowser({ 8 url: input?.url || 'http://beachwaver.com/sitemap_products_1.xml', 9 headers: { 10 'User-Agent': 'curl/7.54.0' 11 } 12 }); 13 14 // Parse sitemap and create RequestList from it 15 const $ = cheerio.load(xml.toString()); 16 const sources = []; 17 $('loc').each(function (val) { 18 const url = $(this).text().trim(); 19 sources.push({ 20 url, 21 headers: { 22 // NOTE: Otherwise the target doesn't allow to download the page! 23 'User-Agent': 'curl/7.54.0', 24 } 25 }); 26 }); 27 console.log(`Found ${sources.length} URLs in the sitemap`) 28 const requestList = new Apify.RequestList({ 29 sources, 30 }); 31 await requestList.initialize(); 32 33 // Crawl each page from sitemap 34 const crawler = new Apify.CheerioCrawler({ 35 requestList, 36 handlePageFunction: async ({ $, request }) => { 37 console.log(`Processing ${request.url}...`); 38 await Apify.pushData({ 39 url: request.url, 40 title: $('title').text(), 41 }); 42 }, 43 }); 44 45 await crawler.run(); 46 console.log('Done.'); 47});

package.json

1{ 2 "name": "apify-project", 3 "version": "0.0.1", 4 "description": "", 5 "author": "It's not you it's me", 6 "license": "ISC", 7 "dependencies": { 8 "apify": "2.2.2", 9 "cheerio": "latest" 10 }, 11 "scripts": { 12 "start": "node main.js" 13 } 14}
Developer
Maintained by Community
Actor stats
  • 24 users
  • 60 runs
  • Modified about 1 year ago

You might also like these Actors