Find and crawl XML sitemaps from any website. Follows sitemap
indexes, handles gzip, and exports every page URL with source
file and lastmod into a clean dataset. No config needed.
Per-input-site stats: after the run, RUN_LOG and log.info list each start URL origin with page URL count and sitemap fetch count (nested sitemaps inherit the seed origin from discovery).
[1.0.12] - 2026-03-28
Removed
sitemapEntryUrls — input is site URLs only (startUrls). Sitemaps are always discovered per origin (robots.txt + default paths); users no longer paste direct sitemap.xml URLs.
[1.0.11] - 2026-03-28
Changed
Input UI no longer exposed HTTP timeout, delay, extra paths, Skip robots.txt, or User-Agent — those values are fixed in code (documented in README).
[1.0.10] - 2026-03-28
Removed
maxUrls from the input UI and from supported input — the actor always collects all URLs from sitemaps (no row cap). RUN_LOG is now urls · sitemaps only (no % / ETA).
[1.0.9] - 2026-03-28
Changed
SEO: Marketplace title, seoTitle, seoDescription, and short description in actor.json — keyword-focused (XML sitemap, URL extract, robots.txt, sitemap index, technical SEO). Input schema title aligned. README H1 updated.
[1.0.8] - 2026-03-28
Fixed
Apify apify push: actor.jsonversion set to 1.1 (MAJOR.MINOR only; three-part semver is invalid on the platform).
[1.0.7] - 2026-03-28
Added
npm run test:smoke — tests/smoke-sitemaps.mjs: live checks against apify.com, doctolib.fr, vercel.com (index vs flat urlset).
.nvmrc (22) and README Troubleshooting for Homebrew Node + simdjson dyld errors (brew reinstall node@22).
[1.0.6] - 2026-03-28
Changed
All user-facing copy is English: input schema, README, changelog, code comments, and Run Log line format (urls / sitemaps labels already English).
[1.0.5] - 2026-03-28
Removed
maxSitemapDepth from the input UI and from API input — index depth is fixed to 10 levels in code (MAX_SITEMAP_INDEX_DEPTH).
[1.0.4] - 2026-03-28
Changed
Input UI: grouped sections (Sources, Limits, HTTP & options), short labels, bulk emphasis — startUrls via requestListSources (list + file import), sitemapEntryUrls and extraSitemapPaths as stringList (one URL/path per line).
[1.0.3] - 2026-03-28
Documentation
Clarified that a sitemap may be a single urlset (no nested / sub-sitemaps) or a sitemapindex; the actor supports both. README and input schema updated.
[1.0.2] - 2026-03-28
Added
sitemapEntryUrls: optional list of full sitemap XML URLs — when set, only these are used as entry points (no robots.txt / default path list). Suited for flows like https://apify.com/sitemap.xml → sitemapindex → each nested urlset.
Changed
Input form: startUrls is no longer strictly required if sitemapEntryUrls is provided.
[1.0.1] - 2026-03-28
Changed
maxUrls is optional by default: omit it to collect every URL from sitemaps (no cap). % and ETA in RUN_LOG only when maxUrls is set.
[1.0.0] - 2026-03-28
Added
Initial release: discover sitemaps via robots.txt + common paths + optional extra paths.
Follow sitemapindex recursively with configurable max depth.