Sitemap Walker Pro — Recursive URL Discovery
Pricing
Pay per event
Sitemap Walker Pro — Recursive URL Discovery
Walk sitemaps and sitemap-index trees recursively. robots.txt fallback. Include/exclude globs/regex, lastmodSince filter, priorityMin filter, optional chunked output. Per-URL rows with image / video / news / hreflang metadata. Pure HTTP, no browser.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 days ago
Last modified
Categories
Share
Sitemap Walker Pro (Recursive URL Discovery + Filters)
Walk sitemaps and sitemap-index trees recursively. Falls back to robots.txt when only a site root is given. Filters URLs by glob, regex, last-modified date, and priority. Emits per-URL rows tagged with optional chunk indexes for parallel downstream actors.
Sitemap Walker Pro Features
- Recursive sitemap-index walk — handles real-world sites that fan out across nested indexes (most large sites do), with cycle detection up to 10 levels deep.
- robots.txt fallback discovers the sitemap location when you pass a bare site root.
- Auto-tries
/sitemap.xml,/sitemap_index.xml,/sitemap-index.xml, and/sitemaps.xmlwhen no sitemap is given. - Glob and regex include / exclude patterns. Wrap a pattern in
/.../to treat it as regex; everything else is picomatch glob. lastmodSinceandpriorityMinfilters for incremental crawls and SEO triage.- Optional
chunkSizetags each row with a 0-based chunk index so you can fan out to parallel downstream actors without re-implementing pagination. - Captures hreflang alternates, image / video / news sitemap metadata when publishers ship them.
- Pure HTTP, no browser, no proxies. Polite 250 ms courtesy delay per host with automatic 429 / 5xx retry.
Who Uses Sitemap Walker Data?
- SEO teams — discover the canonical URL set for a site before crawling, deduping, or running rich-result audits.
- Content engineering — feed downstream Lighthouse, screenshot, or structured-data validator runs with the canonical URL list.
- Migration QA — diff the URL set before and after a CMS migration, with
lastmodSincefor incremental snapshots. - AI training-set curators — pull the publisher-blessed URL list straight from the sitemap, instead of crawling and guessing.
- Competitive research — see exactly which content competitors mark up for indexing, and how often each section updates.
How Sitemap Walker Pro Works
- Pass in a list of seed URLs. Each seed can be a sitemap directly, a sitemap-index, or a bare site root.
- For each seed the actor either fetches the sitemap directly, tries the standard
/sitemap.xmlpaths, or (whenfallbackToRobotsTxtis on) parses/robots.txtforSitemap:directives. - The walker descends into
<sitemapindex>entries recursively up to 10 levels deep, dropping cycles via a shared seen-set across all seeds. - URLs are filtered in order: includePatterns → excludePatterns → lastmodSince → priorityMin. Chunk indexes are applied last when
chunkSizeis set.
Input
{"seeds": ["https://www.apify.com/sitemap.xml"],"fallbackToRobotsTxt": true,"recurseSitemapIndex": true,"includePatterns": ["**/blog/**"],"excludePatterns": ["/draft/"],"lastmodSince": "2026-01-01","priorityMin": 0.5,"chunkSize": 100,"maxUrls": 0,"maxItems": 15}
| Field | Type | Default | Description |
|---|---|---|---|
seeds | array | required | Sitemap URLs OR site roots. Site roots fall back to robots.txt when fallbackToRobotsTxt is on. |
fallbackToRobotsTxt | boolean | true | When a seed lacks an explicit sitemap, parse /robots.txt for Sitemap: directives. |
recurseSitemapIndex | boolean | true | Walk into nested <sitemapindex> entries (most large sites use these). |
includePatterns | array | — | Glob (**/blog/**) or /regex/. Empty = include all. |
excludePatterns | array | — | Same syntax, applied after include. Exclude wins. |
lastmodSince | string | — | ISO date. Only emit URLs with lastmod >= this. |
priorityMin | number | — | Only emit URLs with priority >= this (0.0-1.0). |
chunkSize | integer | — | Group output into chunks of this size; tag each row with a chunk index. |
maxUrls | integer | 0 | Hard cap on emitted URLs. 0 = unlimited. |
maxItems | integer | 15 | Apify-tester safety cap. Override (or set to 0) for production batches. |
The effective cap is the smaller of maxUrls and maxItems when both are set.
Sitemap Walker Pro Output Fields
{"url": "https://example.com/blog/post-1","lastmod": "2026-04-15T10:00:00Z","changefreq": "weekly","priority": 0.8,"sourceSitemap": "https://example.com/sitemap.xml","chunk": 0,"alternates": ["es=https://example.com/es/blog/post-1"],"imageRefs": ["https://example.com/img.jpg;Hero shot"],"videoRefs": [],"newsTitle": null,"newsPublication": null,"newsPublicationDate": null,"newsLanguage": null,"scrapedAt": "2026-04-30T18:00:00Z"}
| Field | Type | Description |
|---|---|---|
url | string | Discovered URL. |
lastmod | string | Last-modified timestamp from the sitemap (ISO string). |
changefreq | string | always, hourly, daily, weekly, monthly, yearly, or never. |
priority | number | Priority hint from the sitemap (0.0-1.0). |
sourceSitemap | string | URL of the sitemap that contained this entry. |
chunk | number | 0-based chunk index when chunkSize is set; 0 otherwise. |
alternates | array | Pipe-joined hreflang=href entries from xhtml:link rel=alternate. |
imageRefs | array | Pipe-joined loc;title entries from image sitemaps. |
videoRefs | array | Pipe-joined title;content_loc entries from video sitemaps. |
newsTitle | string | Google News sitemap title (when present). |
newsPublication | string | Google News publication name (when present). |
newsPublicationDate | string | Google News publication date (when present). |
newsLanguage | string | Google News language code (when present). |
scrapedAt | string | Timestamp when this URL was discovered. |
Pricing
Token charge — functionally free. Apify rejects truly $0 PPE events, so the per-URL price is the smallest practical floor.
| Event | Price |
|---|---|
| Actor start | $0.10 |
| Per discovered URL | $0.0001 |
| Volume | Cost |
|---|---|
| 100 URLs | $0.11 |
| 1,000 URLs | $0.20 |
| 10,000 URLs | $1.10 |
This actor is the cheap discovery primitive that pairs with paid downstream actors. Walk sitemaps liberally.
Limits
maxItemsdefaults to 15 — sized for the Apify tester's 5-minute timeout. Override for production batches by settingmaxItemshigher and / or relying onmaxUrls.- The actor does not fetch the URLs it discovers. Pair with a downstream actor (HTML scraper, Lighthouse, screenshot, structured-data validator) for that.
- Sitemap-index recursion caps at 10 levels deep. Cycles are detected via a shared seen-set across all seeds.
robots.txtDisallow:rules are not enforced. Sitemaps are explicitly the publisher's invitation to fetch the listed URLs.Crawl-delay:directives are not parsed. The walker uses a fixed 250 ms courtesy delay between requests, plus automatic 429 / 5xx retry handling.- Some publishers compress sitemaps as
.xml.gz— these are auto-decompressed.
Related Actors
- Structured Data Validator Pro — feed the discovered URL list straight into structured-data audits.
- SSL & Security Headers Checker — discover URLs for a site, then probe each one's TLS and header posture.
- Angular SSR State Extractor — discover an Angular site's URLs, then pull each page's TransferState payload.
Need More Features?
Need a different output shape, a warehouse integration, or a pre-wired sitemap → fetch → validate chain? File an issue or get in touch.
Why Use Sitemap Walker Pro?
- Functionally free — $0.0001 per URL. Walk a million-URL sitemap for $100 and stop arguing about cost.
- Recursive index walk + robots.txt fallback — handles the messy real world. Most other walkers handle one sitemap and call it a day.
- Chunked output — tag each row with a 0-based chunk index and fan out to N parallel downstream actors without writing a coordinator.
Built by OrbTop.