Sitemap URL Extractor — robots.txt + sitemap.xml Crawl
Pricing
Pay per usage
Sitemap URL Extractor — robots.txt + sitemap.xml Crawl
Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.
Extract every URL a site exposes via its public sitemap chain. Reads robots.txt for Sitemap: declarations, falls back to /sitemap.xml, recursively descends sitemap-index files, and returns one row per discovered URL with lastmod, changefreq, and priority.
Example output row
{"domain": "vercel.com","url": "https://vercel.com/blog/nextjs-14","lastmod": "2024-03-15","changefreq": "weekly","priority": 0.8,"source": "https://vercel.com/sitemap-blog.xml"}
How to use
Input
| Field | Type | Default | Description |
|---|---|---|---|
domains | string[] | ["stripe.com","shopify.com","vercel.com"] | Domains to crawl — no scheme, no trailing slash |
maxUrlsPerDomain | integer | 2000 | Hard cap on URLs returned per domain |
followSitemapIndex | boolean | true | Recursively follow <sitemapindex> child links (up to depth 5) |
Minimal run
{"domains": ["example.com"],"maxUrlsPerDomain": 500,"followSitemapIndex": true}
Output fields
| Field | Type | Notes |
|---|---|---|
domain | string | Input domain |
url | string | Discovered URL from <loc> |
lastmod | string | ISO date, null if absent |
changefreq | string | e.g. weekly, null if absent |
priority | float | 0.0–1.0, null if absent |
source | string | Sitemap file the URL was found in |
Pricing
| Event | Cost | When charged |
|---|---|---|
url_extracted | $0.0001 per URL | Once per run, total = URLs pushed |
A 2 000-URL run costs $0.20. Unused budget is not charged — if a domain has only 300 URLs you pay for 300.
Buyer
- SEO teams auditing crawl coverage — verify every page is in the sitemap.
- Content operations checking
lastmodstaleness across thousands of URLs. - Competitive intelligence — map a competitor's full URL structure.
- QA pipelines validating sitemap health after deploys.
- Link-building researchers finding indexable pages at scale.
Source
Crawl order per domain:
GET https://{domain}/robots.txt— parse allSitemap:lines.- If none found, fall back to
GET https://{domain}/sitemap.xml. - For each sitemap URL: fetch + parse XML.
- If
<sitemapindex>, enqueue each<sitemap><loc>(up to depth 5). - If
<urlset>, emit one row per<url>untilmaxUrlsPerDomainis reached.
All requests use a polite User-Agent and are paced at 250–600 ms between calls. 404 and empty responses are skipped gracefully.