Example Sitemap Cheerio
Try for free
No credit card required
Go to Store
Example Sitemap Cheerio
jancurn/example-sitemap-cheerio
Try for free
No credit card required
An example actor that first downloads a sitemap in XML format and the crawls each page from the sitemap using the fast CheerioCrawler from Apify SDK.
Dockerfile
1# This is a template for a Dockerfile used to run acts in Actor system.
2# The base image name below is set during the act build, based on user settings.
3# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
4FROM apify/actor-node
5
6# Second, copy just package.json and package-lock.json since it should be
7# the only file that affects "npm install" in the next step, to speed up the build
8COPY package*.json ./
9
10# Install NPM packages, skip optional and development dependencies to
11# keep the image small. Avoid logging too much and print the dependency
12# tree for debugging
13RUN npm --quiet set progress=false \
14 && npm install --only=prod --no-optional \
15 && echo "Installed NPM packages:" \
16 && (npm list --all || true) \
17 && echo "Node.js version:" \
18 && node --version \
19 && echo "NPM version:" \
20 && npm --version
21
22# Copy source code to container
23# Do this in the last step, to have fast build if only the source code changed
24COPY . ./
25
26# NOTE: The CMD is already defined by the base image.
27# Uncomment this for local node inspector debugging:
28# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]
main.js
1const Apify = require('apify');
2const cheerio = require('cheerio');
3
4Apify.main(async () => {
5 const input = await Apify.getInput();
6 // Download sitemap
7 const xml = await Apify.utils.requestAsBrowser({
8 url: input?.url || 'http://beachwaver.com/sitemap_products_1.xml',
9 headers: {
10 'User-Agent': 'curl/7.54.0'
11 }
12 });
13
14 // Parse sitemap and create RequestList from it
15 const $ = cheerio.load(xml.toString());
16 const sources = [];
17 $('loc').each(function (val) {
18 const url = $(this).text().trim();
19 sources.push({
20 url,
21 headers: {
22 // NOTE: Otherwise the target doesn't allow to download the page!
23 'User-Agent': 'curl/7.54.0',
24 }
25 });
26 });
27 console.log(`Found ${sources.length} URLs in the sitemap`)
28 const requestList = new Apify.RequestList({
29 sources,
30 });
31 await requestList.initialize();
32
33 // Crawl each page from sitemap
34 const crawler = new Apify.CheerioCrawler({
35 requestList,
36 handlePageFunction: async ({ $, request }) => {
37 console.log(`Processing ${request.url}...`);
38 await Apify.pushData({
39 url: request.url,
40 title: $('title').text(),
41 });
42 },
43 });
44
45 await crawler.run();
46 console.log('Done.');
47});
package.json
1{
2 "name": "apify-project",
3 "version": "0.0.1",
4 "description": "",
5 "author": "It's not you it's me",
6 "license": "ISC",
7 "dependencies": {
8 "apify": "2.2.2",
9 "cheerio": "latest"
10 },
11 "scripts": {
12 "start": "node main.js"
13 }
14}
Developer
Maintained by Community
Actor Metrics
1 monthly user
-
2 stars
>99% runs succeeded
Created in Jan 2019
Modified 2 years ago
Categories