Pricing

Pay per event

Sitemap Walker Pro — Recursive URL Discovery

Walk sitemaps and sitemap-index trees recursively. robots.txt fallback. Include/exclude globs/regex, lastmodSince filter, priorityMin filter, optional chunked output. Per-URL rows with image / video / news / hreflang metadata. Pure HTTP, no browser.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

Actor stats

Bookmarked

Total users

Monthly active users

21 days ago

Last modified

Sitemap URL Extractor Pro — Sitemap Walker & XML Sitemap Scraper (Recursive URL Discovery + Filters)

Extract URLs from any sitemap. This sitemap URL extractor walks sitemap.xml and sitemap-index trees recursively, parses XML sitemaps and gzipped .xml.gz sitemaps, and falls back to robots.txt sitemap discovery when only a site root is given. Filters URLs by glob, regex, last-modified date, and priority. Emits per-URL rows tagged with optional chunk indexes for parallel downstream actors.

Sitemap Walker Pro Features

Recursive sitemap-index walk — handles real-world sites that fan out across nested indexes (most large sites do), with cycle detection up to 10 levels deep.
robots.txt fallback discovers the sitemap location when you pass a bare site root.
Auto-tries /sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, and /sitemaps.xml when no sitemap is given.
Glob and regex include / exclude patterns. Wrap a pattern in /.../ to treat it as regex; everything else is picomatch glob.
lastmodSince and priorityMin filters for incremental crawls and SEO triage.
Optional chunkSize tags each row with a 0-based chunk index so you can fan out to parallel downstream actors without re-implementing pagination.
Captures hreflang alternates, image / video / news sitemap metadata when publishers ship them.
Pure HTTP, no browser, no proxies. Polite 250 ms courtesy delay per host with automatic 429 / 5xx retry.

Who Uses Sitemap Walker Data?

SEO teams — discover the canonical URL set for a site before crawling, deduping, or running rich-result audits.
Content engineering — feed downstream Lighthouse, screenshot, or structured-data validator runs with the canonical URL list.
Migration QA — diff the URL set before and after a CMS migration, with lastmodSince for incremental snapshots.
AI training-set curators — pull the publisher-blessed URL list straight from the sitemap, instead of crawling and guessing.
Competitive research — see exactly which content competitors mark up for indexing, and how often each section updates.

How To Extract URLs From A Sitemap

Pass in a list of seed URLs. Each seed can be a sitemap directly, a sitemap-index, or a bare site root.
For each seed the actor either fetches the sitemap directly, tries the standard /sitemap.xml paths, or (when fallbackToRobotsTxt is on) parses /robots.txt for Sitemap: directives.
The walker descends into <sitemapindex> entries recursively up to 10 levels deep, dropping cycles via a shared seen-set across all seeds.
URLs are filtered in order: includePatterns → excludePatterns → lastmodSince → priorityMin. Chunk indexes are applied last when chunkSize is set.

Input

{
  "seeds": ["https://www.apify.com/sitemap.xml"],
  "fallbackToRobotsTxt": true,
  "recurseSitemapIndex": true,
  "includePatterns": ["**/blog/**"],
  "excludePatterns": ["/draft/"],
  "lastmodSince": "2026-01-01",
  "priorityMin": 0.5,
  "chunkSize": 100,
  "maxUrls": 0,
  "maxItems": 15
}

Field	Type	Default	Description
`seeds`	array	required	Sitemap URLs OR site roots. Site roots fall back to robots.txt when `fallbackToRobotsTxt` is on.
`fallbackToRobotsTxt`	boolean	true	When a seed lacks an explicit sitemap, parse `/robots.txt` for `Sitemap:` directives.
`recurseSitemapIndex`	boolean	true	Walk into nested `<sitemapindex>` entries (most large sites use these).
`includePatterns`	array	—	Glob (`/blog/`) or `/regex/`. Empty = include all.
`excludePatterns`	array	—	Same syntax, applied after include. Exclude wins.
`lastmodSince`	string	—	ISO date. Only emit URLs with `lastmod >= this`.
`priorityMin`	number	—	Only emit URLs with `priority >= this` (0.0-1.0).
`chunkSize`	integer	—	Group output into chunks of this size; tag each row with a chunk index.
`maxUrls`	integer	0	Hard cap on emitted URLs. `0` = unlimited.
`maxItems`	integer	15	Apify-tester safety cap. Override (or set to 0) for production batches.

The effective cap is the smaller of maxUrls and maxItems when both are set.

Sitemap Walker Pro Output Fields

{
  "url": "https://example.com/blog/post-1",
  "lastmod": "2026-04-15T10:00:00Z",
  "changefreq": "weekly",
  "priority": 0.8,
  "sourceSitemap": "https://example.com/sitemap.xml",
  "chunk": 0,
  "alternates":  ["es=https://example.com/es/blog/post-1"],
  "imageRefs":   ["https://example.com/img.jpg;Hero shot"],
  "videoRefs":   [],
  "newsTitle":           null,
  "newsPublication":     null,
  "newsPublicationDate": null,
  "newsLanguage":        null,
  "scrapedAt": "2026-04-30T18:00:00Z"
}

Field	Type	Description
`url`	string	Discovered URL.
`lastmod`	string	Last-modified timestamp from the sitemap (ISO string).
`changefreq`	string	`always`, `hourly`, `daily`, `weekly`, `monthly`, `yearly`, or `never`.
`priority`	number	Priority hint from the sitemap (0.0-1.0).
`sourceSitemap`	string	URL of the sitemap that contained this entry.
`chunk`	number	0-based chunk index when `chunkSize` is set; `0` otherwise.
`alternates`	array	Pipe-joined `hreflang=href` entries from `xhtml:link rel=alternate`.
`imageRefs`	array	Pipe-joined `loc;title` entries from image sitemaps.
`videoRefs`	array	Pipe-joined `title;content_loc` entries from video sitemaps.
`newsTitle`	string	Google News sitemap title (when present).
`newsPublication`	string	Google News publication name (when present).
`newsPublicationDate`	string	Google News publication date (when present).
`newsLanguage`	string	Google News language code (when present).
`scrapedAt`	string	Timestamp when this URL was discovered.

Pricing

Token charge — functionally free. Apify rejects truly $0 PPE events, so the per-URL price is the smallest practical floor.

Event	Price
Actor start	$0.10
Per discovered URL	$0.0001

Volume	Cost
100 URLs	$0.11
1,000 URLs	$0.20
10,000 URLs	$1.10

This actor is the cheap discovery primitive that pairs with paid downstream actors. Walk sitemaps liberally.

Limits

maxItems defaults to 15 — sized for the Apify tester's 5-minute timeout. Override for production batches by setting maxItems higher and / or relying on maxUrls.
The actor does not fetch the URLs it discovers. Pair with a downstream actor (HTML scraper, Lighthouse, screenshot, structured-data validator) for that.
Sitemap-index recursion caps at 10 levels deep. Cycles are detected via a shared seen-set across all seeds.
robots.txt Disallow: rules are not enforced. Sitemaps are explicitly the publisher's invitation to fetch the listed URLs.
Crawl-delay: directives are not parsed. The walker uses a fixed 250 ms courtesy delay between requests, plus automatic 429 / 5xx retry handling.
Some publishers compress sitemaps as .xml.gz — these are auto-decompressed.

FAQ

How do I extract all URLs from an XML sitemap? Pass the sitemap URL (or a bare site root) in seeds. The extractor fetches the sitemap.xml, recurses into any nested sitemap-index entries, and returns one row per discovered URL with its lastmod, priority, and source sitemap. Gzipped .xml.gz sitemaps are auto-decompressed, and robots.txt is parsed for the sitemap location when no explicit sitemap is given.

Structured Data Validator Pro — feed the discovered URL list straight into structured-data audits.
SSL & Security Headers Checker — discover URLs for a site, then probe each one's TLS and header posture.
Angular SSR State Extractor — discover an Angular site's URLs, then pull each page's TransferState payload.

Need More Features?

Need a different output shape, a warehouse integration, or a pre-wired sitemap → fetch → validate chain? File an issue or get in touch.

Why Use Sitemap Walker Pro?

Functionally free — $0.0001 per URL. Walk a million-URL sitemap for $100 and stop arguing about cost.
Recursive index walk + robots.txt fallback — handles the messy real world. Most other walkers handle one sitemap and call it a day.
Chunked output — tag each row with a 0-based chunk index and fan out to N parallel downstream actors without writing a coordinator.

Built by OrbTop.

Sitemap URL Extractor

crawlerbros/sitemap-url-extractor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.

Crawler Bros

Sitemap URL Extractor: Every URL, Recursive

thoob/sitemap-extractor

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pono Data

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

Maxime Dupré

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

v0iddo/sitemap-url-extractor

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

vøiddo

Sitemap URL Extractor

blazing_stake/sitemap-url-extractor

Extract every URL from any website's sitemap, including nested sitemap indexes (recursive). Auto-discovers sitemaps from robots.txt. Returns URLs with lastmod, changefreq, priority.

Mehmet Kut

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser — fast and cheap.

Walid

Sitemap Sniffer

crawlerbros/sitemap-sniffer

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Crawler Bros

Website Sitemap Extractor

glassventures/website-sitemap-extractor

Extract all URLs from any website's sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compression. Filter by URL pattern, date range.

Glass Ventures

Sitemap URL Intelligence

toronto_777/sitemap-url-intelligence

Discover robots.txt sitemap entries and classify public sitemap URLs by page type.

Steven Feng

Sitemap URL Finder

thescrapelab/sitemap-target-url-extractor

Find and export URLs from any website’s robots.txt and sitemaps. Enter a domain or website URL, optionally filter matching URLs by text, and get clean dataset rows with the URL, domain, path, source sitemap, and match details.