Sitemap Walker Pro — Recursive URL Discovery avatar

Sitemap Walker Pro — Recursive URL Discovery

Pricing

Pay per event

Go to Apify Store
Sitemap Walker Pro — Recursive URL Discovery

Sitemap Walker Pro — Recursive URL Discovery

Walk sitemaps and sitemap-index trees recursively. robots.txt fallback. Include/exclude globs/regex, lastmodSince filter, priorityMin filter, optional chunked output. Per-URL rows with image / video / news / hreflang metadata. Pure HTTP, no browser.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 days ago

Last modified

Share

Sitemap Walker Pro (Recursive URL Discovery + Filters)

Walk sitemaps and sitemap-index trees recursively. Falls back to robots.txt when only a site root is given. Filters URLs by glob, regex, last-modified date, and priority. Emits per-URL rows tagged with optional chunk indexes for parallel downstream actors.


Sitemap Walker Pro Features

  • Recursive sitemap-index walk — handles real-world sites that fan out across nested indexes (most large sites do), with cycle detection up to 10 levels deep.
  • robots.txt fallback discovers the sitemap location when you pass a bare site root.
  • Auto-tries /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, and /sitemaps.xml when no sitemap is given.
  • Glob and regex include / exclude patterns. Wrap a pattern in /.../ to treat it as regex; everything else is picomatch glob.
  • lastmodSince and priorityMin filters for incremental crawls and SEO triage.
  • Optional chunkSize tags each row with a 0-based chunk index so you can fan out to parallel downstream actors without re-implementing pagination.
  • Captures hreflang alternates, image / video / news sitemap metadata when publishers ship them.
  • Pure HTTP, no browser, no proxies. Polite 250 ms courtesy delay per host with automatic 429 / 5xx retry.

Who Uses Sitemap Walker Data?

  • SEO teams — discover the canonical URL set for a site before crawling, deduping, or running rich-result audits.
  • Content engineering — feed downstream Lighthouse, screenshot, or structured-data validator runs with the canonical URL list.
  • Migration QA — diff the URL set before and after a CMS migration, with lastmodSince for incremental snapshots.
  • AI training-set curators — pull the publisher-blessed URL list straight from the sitemap, instead of crawling and guessing.
  • Competitive research — see exactly which content competitors mark up for indexing, and how often each section updates.

How Sitemap Walker Pro Works

  1. Pass in a list of seed URLs. Each seed can be a sitemap directly, a sitemap-index, or a bare site root.
  2. For each seed the actor either fetches the sitemap directly, tries the standard /sitemap.xml paths, or (when fallbackToRobotsTxt is on) parses /robots.txt for Sitemap: directives.
  3. The walker descends into <sitemapindex> entries recursively up to 10 levels deep, dropping cycles via a shared seen-set across all seeds.
  4. URLs are filtered in order: includePatterns → excludePatterns → lastmodSince → priorityMin. Chunk indexes are applied last when chunkSize is set.

Input

{
"seeds": ["https://www.apify.com/sitemap.xml"],
"fallbackToRobotsTxt": true,
"recurseSitemapIndex": true,
"includePatterns": ["**/blog/**"],
"excludePatterns": ["/draft/"],
"lastmodSince": "2026-01-01",
"priorityMin": 0.5,
"chunkSize": 100,
"maxUrls": 0,
"maxItems": 15
}
FieldTypeDefaultDescription
seedsarrayrequiredSitemap URLs OR site roots. Site roots fall back to robots.txt when fallbackToRobotsTxt is on.
fallbackToRobotsTxtbooleantrueWhen a seed lacks an explicit sitemap, parse /robots.txt for Sitemap: directives.
recurseSitemapIndexbooleantrueWalk into nested <sitemapindex> entries (most large sites use these).
includePatternsarrayGlob (**/blog/**) or /regex/. Empty = include all.
excludePatternsarraySame syntax, applied after include. Exclude wins.
lastmodSincestringISO date. Only emit URLs with lastmod >= this.
priorityMinnumberOnly emit URLs with priority >= this (0.0-1.0).
chunkSizeintegerGroup output into chunks of this size; tag each row with a chunk index.
maxUrlsinteger0Hard cap on emitted URLs. 0 = unlimited.
maxItemsinteger15Apify-tester safety cap. Override (or set to 0) for production batches.

The effective cap is the smaller of maxUrls and maxItems when both are set.


Sitemap Walker Pro Output Fields

{
"url": "https://example.com/blog/post-1",
"lastmod": "2026-04-15T10:00:00Z",
"changefreq": "weekly",
"priority": 0.8,
"sourceSitemap": "https://example.com/sitemap.xml",
"chunk": 0,
"alternates": ["es=https://example.com/es/blog/post-1"],
"imageRefs": ["https://example.com/img.jpg;Hero shot"],
"videoRefs": [],
"newsTitle": null,
"newsPublication": null,
"newsPublicationDate": null,
"newsLanguage": null,
"scrapedAt": "2026-04-30T18:00:00Z"
}
FieldTypeDescription
urlstringDiscovered URL.
lastmodstringLast-modified timestamp from the sitemap (ISO string).
changefreqstringalways, hourly, daily, weekly, monthly, yearly, or never.
prioritynumberPriority hint from the sitemap (0.0-1.0).
sourceSitemapstringURL of the sitemap that contained this entry.
chunknumber0-based chunk index when chunkSize is set; 0 otherwise.
alternatesarrayPipe-joined hreflang=href entries from xhtml:link rel=alternate.
imageRefsarrayPipe-joined loc;title entries from image sitemaps.
videoRefsarrayPipe-joined title;content_loc entries from video sitemaps.
newsTitlestringGoogle News sitemap title (when present).
newsPublicationstringGoogle News publication name (when present).
newsPublicationDatestringGoogle News publication date (when present).
newsLanguagestringGoogle News language code (when present).
scrapedAtstringTimestamp when this URL was discovered.

Pricing

Token charge — functionally free. Apify rejects truly $0 PPE events, so the per-URL price is the smallest practical floor.

EventPrice
Actor start$0.10
Per discovered URL$0.0001
VolumeCost
100 URLs$0.11
1,000 URLs$0.20
10,000 URLs$1.10

This actor is the cheap discovery primitive that pairs with paid downstream actors. Walk sitemaps liberally.


Limits

  • maxItems defaults to 15 — sized for the Apify tester's 5-minute timeout. Override for production batches by setting maxItems higher and / or relying on maxUrls.
  • The actor does not fetch the URLs it discovers. Pair with a downstream actor (HTML scraper, Lighthouse, screenshot, structured-data validator) for that.
  • Sitemap-index recursion caps at 10 levels deep. Cycles are detected via a shared seen-set across all seeds.
  • robots.txt Disallow: rules are not enforced. Sitemaps are explicitly the publisher's invitation to fetch the listed URLs.
  • Crawl-delay: directives are not parsed. The walker uses a fixed 250 ms courtesy delay between requests, plus automatic 429 / 5xx retry handling.
  • Some publishers compress sitemaps as .xml.gz — these are auto-decompressed.

  • Structured Data Validator Pro — feed the discovered URL list straight into structured-data audits.
  • SSL & Security Headers Checker — discover URLs for a site, then probe each one's TLS and header posture.
  • Angular SSR State Extractor — discover an Angular site's URLs, then pull each page's TransferState payload.

Need More Features?

Need a different output shape, a warehouse integration, or a pre-wired sitemap → fetch → validate chain? File an issue or get in touch.

Why Use Sitemap Walker Pro?

  • Functionally free — $0.0001 per URL. Walk a million-URL sitemap for $100 and stop arguing about cost.
  • Recursive index walk + robots.txt fallback — handles the messy real world. Most other walkers handle one sitemap and call it a day.
  • Chunked output — tag each row with a 0-based chunk index and fan out to N parallel downstream actors without writing a coordinator.

Built by OrbTop.