Sitemap Finder & URL Extractor · Crawl Any XML Sitemap
Pricing
from $0.99 / 1,000 results
Sitemap Finder & URL Extractor · Crawl Any XML Sitemap
Find and crawl XML sitemaps from any website. Follows sitemap indexes, handles gzip, and exports every page URL with source file and lastmod into a clean dataset. No config needed.
Pricing
from $0.99 / 1,000 results
Rating
0.0
(0)
Developer
Corentin Robert
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Turn a list of websites into a full URL inventory from their XML sitemaps—no manual sitemap.xml links, no knobs to tune. The actor discovers sitemaps for you, walks sitemap indexes, handles gzip, and writes a tidy dataset: every page URL, the source sitemap file, and lastmod when the site publishes it.
Built for technical SEO audits, migrations, content inventories, and any workflow where you need “all public URLs the site exposes in sitemaps.”
Support & contact
Need help? If something’s unclear, a run fails in an unexpected way, or you’d like a small customization, you can email Corentin Robert (corentin@outreacher.fr).
What you get
- One row per URL in the default dataset, ready to export (JSON, CSV, API).
- Automatic discovery from each site’s origin:
robots.txtSitemap: directives plus common paths (/sitemap.xml, WordPress-style paths, etc.). - Nested sitemap indexes followed for you (up to 10 levels deep—fixed in code).
- Live progress in
RUN_LOG, plus a per-site summary at the end (URL count and sitemap fetch count per input origin).
How it works (short)
- You add one start URL per site (any page on that domain—the actor only uses the origin).
- It resolves entry sitemap URLs, then fetches each file (HTTP 200 only).
- It understands both
urlset(URLs inline) andsitemapindex(pointers to more sitemaps), then flattens everything into page URLs.
Many sites use a single urlset; larger sites often ship a root sitemapindex that fans out to multiple urlset files—both patterns are supported.
Quick start
- In the Apify console, open Site URLs and paste one URL per site (bulk paste or file import works).
- Start the run. That’s it—timeouts, pacing, and headers are fixed defaults (see below).
Input
| Field | What to enter |
|---|---|
| startUrls | One URL per website (homepage, blog article, anything on that domain). Same domain twice → one logical “site” in the summary. |
Output (dataset)
| Field | Meaning |
|---|---|
url | Page URL from <loc> |
sourceSitemapUrl | The urlset file this row came from |
lastmod | From the sitemap if present, otherwise null |
discoveredFrom | robots (from robots.txt), candidate (default paths), or nested (via an index) |
fetchedAt | ISO timestamp when the row was written |
Run log
During the run, RUN_LOG in the key-value store stays compact—for example:
12,400 urls · 15 sitemaps
Failed fetches appear as ERR: lines (timeout, non-200, or body that isn’t XML)—without dumping raw responses.
When the run finishes, you’ll see Per input site (origin): with a line per distinct origin (URLs collected + sitemaps fetched). The same breakdown is emitted to Apify’s Log tab via log.info.
Built-in defaults (not shown in the UI)
| Topic | Value |
|---|---|
| HTTP timeout | 60 s per request |
| Pause between sitemap requests | 100 ms |
| Extra sitemap paths | None beyond the built-in discovery list |
robots.txt | Read during discovery |
| User-Agent | SitemapDetector/1.0 (+https://apify.com) |
Limits & good citizenship
- Index depth is capped at 10 nested
sitemapindexlevels (by design). - Endpoints that return non-XML (e.g. JSON “sitemap APIs”) are skipped—you’ll see
not XMLin the log. - Very large sitemaps mean longer runs and higher memory use on Apify; that’s expected.
- Use this on sites you’re allowed to query; it only requests public sitemap URLs.
Example input
One site
{"startUrls": [{ "url": "https://example.com/" }]}
Several sites in one run
{"startUrls": [{ "url": "https://apify.com/" },{ "url": "https://vercel.com/docs" }]}
Testing & deploy
Smoke tests (parser + live fetches, no full Actor container):
$npm run test:smoke
Uses Apify, Doctolib, and Vercel sitemaps as fixtures (the script hits sitemap.xml URLs directly; the Actor itself only needs start URLs).
Run the Actor locally
$apify run --input-file tests/manual-runs/smoke-apify-actor.json
More samples: tests/manual-runs/.
Publish — bump .actor/actor.json version to MAJOR.MINOR (e.g. 1.7), not npm-style 1.0.13, then:
$apify push