Sitemap URL Extractor — robots.txt + sitemap.xml Crawl avatar

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

Pricing

Pay per usage

Go to Apify Store
Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

vøiddo

vøiddo

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Extract every URL a site exposes via its public sitemap chain. Reads robots.txt for Sitemap: declarations, falls back to /sitemap.xml, recursively descends sitemap-index files, and returns one row per discovered URL with lastmod, changefreq, and priority.

Example output row

{
"domain": "vercel.com",
"url": "https://vercel.com/blog/nextjs-14",
"lastmod": "2024-03-15",
"changefreq": "weekly",
"priority": 0.8,
"source": "https://vercel.com/sitemap-blog.xml"
}

How to use

Input

FieldTypeDefaultDescription
domainsstring[]["stripe.com","shopify.com","vercel.com"]Domains to crawl — no scheme, no trailing slash
maxUrlsPerDomaininteger2000Hard cap on URLs returned per domain
followSitemapIndexbooleantrueRecursively follow <sitemapindex> child links (up to depth 5)

Minimal run

{
"domains": ["example.com"],
"maxUrlsPerDomain": 500,
"followSitemapIndex": true
}

Output fields

FieldTypeNotes
domainstringInput domain
urlstringDiscovered URL from <loc>
lastmodstringISO date, null if absent
changefreqstringe.g. weekly, null if absent
priorityfloat0.0–1.0, null if absent
sourcestringSitemap file the URL was found in

Pricing

EventCostWhen charged
url_extracted$0.0001 per URLOnce per run, total = URLs pushed

A 2 000-URL run costs $0.20. Unused budget is not charged — if a domain has only 300 URLs you pay for 300.

Buyer

  • SEO teams auditing crawl coverage — verify every page is in the sitemap.
  • Content operations checking lastmod staleness across thousands of URLs.
  • Competitive intelligence — map a competitor's full URL structure.
  • QA pipelines validating sitemap health after deploys.
  • Link-building researchers finding indexable pages at scale.

Source

Crawl order per domain:

  1. GET https://{domain}/robots.txt — parse all Sitemap: lines.
  2. If none found, fall back to GET https://{domain}/sitemap.xml.
  3. For each sitemap URL: fetch + parse XML.
  4. If <sitemapindex>, enqueue each <sitemap><loc> (up to depth 5).
  5. If <urlset>, emit one row per <url> until maxUrlsPerDomain is reached.

All requests use a polite User-Agent and are paced at 250–600 ms between calls. 404 and empty responses are skipped gracefully.