Sitemap Finder & URL Extractor · Crawl Any XML Sitemap avatar

Sitemap Finder & URL Extractor · Crawl Any XML Sitemap

Pricing

from $0.99 / 1,000 results

Go to Apify Store
Sitemap Finder & URL Extractor · Crawl Any XML Sitemap

Sitemap Finder & URL Extractor · Crawl Any XML Sitemap

Find and crawl XML sitemaps from any website. Follows sitemap indexes, handles gzip, and exports every page URL with source file and lastmod into a clean dataset. No config needed.

Pricing

from $0.99 / 1,000 results

Rating

0.0

(0)

Developer

Corentin Robert

Corentin Robert

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Turn a list of websites into a full URL inventory from their XML sitemaps—no manual sitemap.xml links, no knobs to tune. The actor discovers sitemaps for you, walks sitemap indexes, handles gzip, and writes a tidy dataset: every page URL, the source sitemap file, and lastmod when the site publishes it.

Built for technical SEO audits, migrations, content inventories, and any workflow where you need “all public URLs the site exposes in sitemaps.”


Support & contact

Need help? If something’s unclear, a run fails in an unexpected way, or you’d like a small customization, you can email Corentin Robert (corentin@outreacher.fr).


What you get

  • One row per URL in the default dataset, ready to export (JSON, CSV, API).
  • Automatic discovery from each site’s origin: robots.txt Sitemap: directives plus common paths (/sitemap.xml, WordPress-style paths, etc.).
  • Nested sitemap indexes followed for you (up to 10 levels deep—fixed in code).
  • Live progress in RUN_LOG, plus a per-site summary at the end (URL count and sitemap fetch count per input origin).

How it works (short)

  1. You add one start URL per site (any page on that domain—the actor only uses the origin).
  2. It resolves entry sitemap URLs, then fetches each file (HTTP 200 only).
  3. It understands both urlset (URLs inline) and sitemapindex (pointers to more sitemaps), then flattens everything into page URLs.

Many sites use a single urlset; larger sites often ship a root sitemapindex that fans out to multiple urlset files—both patterns are supported.


Quick start

  1. In the Apify console, open Site URLs and paste one URL per site (bulk paste or file import works).
  2. Start the run. That’s it—timeouts, pacing, and headers are fixed defaults (see below).

Input

FieldWhat to enter
startUrlsOne URL per website (homepage, blog article, anything on that domain). Same domain twice → one logical “site” in the summary.

Output (dataset)

FieldMeaning
urlPage URL from <loc>
sourceSitemapUrlThe urlset file this row came from
lastmodFrom the sitemap if present, otherwise null
discoveredFromrobots (from robots.txt), candidate (default paths), or nested (via an index)
fetchedAtISO timestamp when the row was written

Run log

During the run, RUN_LOG in the key-value store stays compact—for example:

12,400 urls · 15 sitemaps

Failed fetches appear as ERR: lines (timeout, non-200, or body that isn’t XML)—without dumping raw responses.

When the run finishes, you’ll see Per input site (origin): with a line per distinct origin (URLs collected + sitemaps fetched). The same breakdown is emitted to Apify’s Log tab via log.info.


Built-in defaults (not shown in the UI)

TopicValue
HTTP timeout60 s per request
Pause between sitemap requests100 ms
Extra sitemap pathsNone beyond the built-in discovery list
robots.txtRead during discovery
User-AgentSitemapDetector/1.0 (+https://apify.com)

Limits & good citizenship

  • Index depth is capped at 10 nested sitemapindex levels (by design).
  • Endpoints that return non-XML (e.g. JSON “sitemap APIs”) are skipped—you’ll see not XML in the log.
  • Very large sitemaps mean longer runs and higher memory use on Apify; that’s expected.
  • Use this on sites you’re allowed to query; it only requests public sitemap URLs.

Example input

One site

{
"startUrls": [{ "url": "https://example.com/" }]
}

Several sites in one run

{
"startUrls": [
{ "url": "https://apify.com/" },
{ "url": "https://vercel.com/docs" }
]
}

Testing & deploy

Smoke tests (parser + live fetches, no full Actor container):

$npm run test:smoke

Uses Apify, Doctolib, and Vercel sitemaps as fixtures (the script hits sitemap.xml URLs directly; the Actor itself only needs start URLs).

Run the Actor locally

$apify run --input-file tests/manual-runs/smoke-apify-actor.json

More samples: tests/manual-runs/.

Publish — bump .actor/actor.json version to MAJOR.MINOR (e.g. 1.7), not npm-style 1.0.13, then:

$apify push