Sitemap URL Extractor
Pricing
from $1.00 / 1,000 results
Sitemap URL Extractor
Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(14)
Developer
Crawler Bros
Maintained by CommunityActor stats
14
Bookmarked
2
Total users
0
Monthly active users
4 days ago
Last modified
Categories
Share
Sitemap URL Extractor — Pull Every URL From Any Website's Sitemap
Extract the complete URL inventory of any website in seconds — straight from its XML sitemap, no proxy or login required.
What this actor does
Point this actor at a website and it returns every page URL that site publishes in its sitemap. You can pass a direct sitemap URL (https://example.com/sitemap.xml), a gzipped sitemap (https://example.com/sitemap.xml.gz), a sitemap index file that points at dozens of child sitemaps, or just a bare domain (https://example.com) and the actor will discover the sitemap for you via robots.txt and common fallback paths.
Behind the scenes the actor performs pure XML parsing over standard HTTP. It walks nested sitemap index files to any depth, decompresses gzipped sitemaps automatically, and preserves per-URL metadata such as last-modified dates, change frequency, priority, and image/video/hreflang annotations whenever the source sitemap includes them. Include and exclude regex filters let you narrow the output to just the section of the site you care about (e.g. only product pages, only blog posts).
Because sitemaps are public endpoints designed to be consumed by search engines, this actor runs without cookies, without proxies, and without any authentication — making it one of the cheapest and fastest ways to map a website end-to-end.
Key features
- Bare-domain auto-discovery — pass
https://example.comand the actor finds the sitemap viarobots.txtand common paths (/sitemap.xml,/sitemap_index.xml). - Gzipped sitemap support —
.xml.gzfiles are decompressed transparently. - Recursive sitemap index walking — nested sitemap trees are fetched and merged into a single flat list.
- Image sitemap extraction — preserves image URLs, captions, titles when the sitemap contains them.
- Video & hreflang annotations — kept per URL whenever the source exposes them.
- Regex include / exclude filters — trim the output to a specific section of the site.
- Max URL cap — stop early once you have enough data.
- Public data only — no proxy, no cookies, no login.
Input
| Field | Type | Description |
|---|---|---|
startUrls | array | One or more sitemap URLs, sitemap index URLs, .xml.gz URLs, or bare domains. |
maxUrls | integer | Upper bound on total URLs returned. 0 means unlimited. |
followSitemapIndexes | boolean | When true (default), nested sitemap index files are expanded recursively. |
urlFilterInclude | string | Optional regex. Only URLs matching this pattern are kept. |
urlFilterExclude | string | Optional regex. URLs matching this pattern are dropped. |
Example input
{"startUrls": [{ "url": "https://www.nytimes.com" },{ "url": "https://example.com/sitemap.xml" }],"maxUrls": 5000,"followSitemapIndexes": true,"urlFilterInclude": "/202[4-6]/","urlFilterExclude": "/tag/|/category/"}
Output
Each dataset record represents one URL found in a sitemap:
{"url": "https://example.com/products/red-shoes","lastmod": "2026-03-12T08:14:00+00:00","changefreq": "weekly","priority": 0.8,"source": "https://example.com/sitemap-products.xml","images": [{ "loc": "https://example.com/img/red-shoes.jpg", "title": "Red Shoes" }],"alternates": [{ "hreflang": "de", "href": "https://example.com/de/schuhe-rot" }]}
Field descriptions
url— the page URL listed in the sitemap.lastmod— last-modified timestamp reported by the sitemap, if present.changefreq— publisher-declared change frequency (daily,weekly,monthly, …).priority— publisher-declared priority hint (0.0–1.0).source— the sitemap file this URL was extracted from (useful with nested indexes).images— image sitemap entries attached to the URL (when present).videos— video sitemap entries attached to the URL (when present).alternates—hreflangalternate-language URLs (when present).
Empty or missing fields are omitted rather than emitted as null, so records stay compact.
Use cases
- SEO audits — pull every indexable URL a site exposes, then cross-check against Google Search Console coverage.
- Site migrations — generate a full URL inventory before cutover to build redirect maps.
- Competitive research — map a competitor's product catalog, blog, or news archive.
- Content crawling seed list — feed the extracted URLs into a downstream scraper or LLM ingestion pipeline.
- Broken-link discovery — pair the URL list with a link checker to find 404s across large sites.
FAQ
Do I need a proxy? No. Sitemap endpoints are public and designed for consumption by search engines, so they do not gate by IP or require cookies.
What if I only have a domain and don't know the sitemap URL?
Pass the bare domain (https://example.com). The actor reads robots.txt, falls back to common sitemap paths, and walks any index files it finds.
How does it handle very large sitemaps?
Each sitemap file is streamed and parsed as it's received. Use maxUrls to cap the output when you do not need the entire site.
Does it follow nested sitemap index files?
Yes, by default. Set followSitemapIndexes to false to stop at the first level and receive only index entries.
Can it extract images from image sitemaps?
Yes. If the sitemap uses the <image:image> extension, the image entries are preserved on the per-URL record.
Why is lastmod missing on some records?
Sitemap fields are optional. If the publisher did not include a last-modified date for a URL, it is omitted from the output instead of padded with a placeholder.
What happens if the sitemap URL returns a 404 or HTML page? The actor emits a compact error record describing the failure and continues with any other inputs. Your dataset is never silently empty.
Known limitations
- HTML-only websites with no sitemap cannot be mapped. If
robots.txtdoes not declare a sitemap and common fallback paths are empty, the actor has nothing to parse. Use a full crawler for those cases. - JavaScript-built sitemap pages (rare — some single-page apps render
/sitemapas HTML) are not XML and are not supported. - Extremely deep nested indexes — sitemap trees are expanded recursively, but the total URL count is still bounded by
maxUrlsto keep runs predictable. - Sitemap accuracy depends on the publisher — if the site forgets to list a URL in its sitemap, this actor cannot discover it (there is no HTML fallback crawling here by design).