# This is a template for a Dockerfile used to run acts in Actor system.
# The base image name below is set during the act build, based on user settings.
# IMPORTANT: The base image must set a correct working directory, such as /usr/src/app or /home/user
FROM apify/actor-node

# Second, copy just package.json and package-lock.json since it should be
# the only file that affects "npm install" in the next step, to speed up the build
COPY package*.json ./

# Install NPM packages, skip optional and development dependencies to
# keep the image small. Avoid logging too much and print the dependency
# tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed NPM packages:" \
 && (npm list --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "NPM version:" \
 && npm --version

# Copy source code to container
# Do this in the last step, to have fast build if only the source code changed
COPY  . ./

# NOTE: The CMD is already defined by the base image.
# Uncomment this for local node inspector debugging:
# CMD [ "node", "--inspect=0.0.0.0:9229", "main.js" ]

main.js

1const Apify = require('apify');
2const cheerio = require('cheerio');
3
4Apify.main(async () => {
5    const input = await Apify.getInput();
6    // Download sitemap
7    const xml = await Apify.utils.requestAsBrowser({
8        url: input?.url || 'http://beachwaver.com/sitemap_products_1.xml',
9        headers: {
10            'User-Agent': 'curl/7.54.0'
11        }
12    });
13    
14    // Parse sitemap and create RequestList from it
15    const $ = cheerio.load(xml.toString());
16    const sources = [];
17    $('loc').each(function (val) {
18        const url = $(this).text().trim();
19        sources.push({
20            url,
21            headers: {
22                // NOTE: Otherwise the target doesn't allow to download the page!
23                'User-Agent': 'curl/7.54.0',
24            }
25        });
26    });
27    console.log(`Found ${sources.length} URLs in the sitemap`)
28    const requestList = new Apify.RequestList({
29        sources,
30    });
31    await requestList.initialize();
32    
33    // Crawl each page from sitemap
34    const crawler = new Apify.CheerioCrawler({
35        requestList,
36        handlePageFunction: async ({ $, request }) => {
37            console.log(`Processing ${request.url}...`);
38            await Apify.pushData({
39                url: request.url,
40                title: $('title').text(),
41            });
42        },
43    });
44
45    await crawler.run();
46    console.log('Done.');
47});

package.json

{
    "name": "apify-project",
    "version": "0.0.1",
    "description": "",
    "author": "It's not you it's me",
    "license": "ISC",
    "dependencies": {
        "apify": "2.2.2",
        "cheerio": "latest"
    },
    "scripts": {
        "start": "node main.js"
    }
}

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

128

5.0

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).

One Scales

183

5.0

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

125

Sitemap Detector

coder_zoro/sitemap-detector

Find sitemap URLs fast with our free Sitemap Finder tool. Instantly detect sitemaps from any website for SEO audits, indexing checks, and crawl planning. Improve visibility, site structure insights, and search engine performance in just seconds

Zoro

5.0

Sitemap To Request Queue

pocesar/sitemap-to-request-queue

Download sitemap XMLs and put them in a RequestQueue

Paulo Cesar

114

RSS / XML Scraper

jupri/rss-xml-scraper

💫 Scrape RSS / XML / Sitemap or other XML

cat

720

4.3

Internal Links Scraper

mysteriousshadow/internal-links-scraper

When given a sitemap of a website, this scraper will go through every page listed on the sitemap and find all the internal links. Useful for SEO, finding orphaned pages, and visualizing internal linking structure.

Mysterious Shadow

Actor in Go example

jirimoravcik/go-actor-example

Example actor written in Go.

Jiří Moravčík

Actor in Rust Example

lukaskrivka/rust-actor-example

Example actor built in Rust programming language. Downloads HTML from any page. Works on Apify platform and locally.

Lukáš Křivka

Sitemap Sniffer

vaclavrut/sitemap-sniffer

Sitemap sniffer will check the most used variants of sitemaps and you can use that for crawling. This will just save you time so you don't have to check manually.