Pricing

$0.90 / 1,000 checked urls

Sitemap Validator

Validate XML sitemaps and sitemap indexes. Check listed URLs for HTTP status, redirects, final URL, response time, malformed URLs, and sitemap metadata.

Pricing

$0.90 / 1,000 checked urls

Rating

0.0

(0)

Developer

Maxime Dupré

Actor stats

Bookmarked

Total users

Monthly active users

20 days ago

Last modified

🗺️ Sitemap validator for URL health checks

Sitemap Validator checks public XML sitemaps, sitemap indexes, website roots, bare domains, and robots.txt files. Add a target such as apify.com/sitemap.xml, and the Actor parses the sitemap, follows child sitemap indexes within your depth limit, checks the listed URLs, and saves one row per checked URL.

Use this sitemap validator when you need a fast technical SEO check before a migration, release, crawl-budget review, client audit, or broken-link cleanup. Each row keeps the page URL, source sitemap URL, parent sitemap index, HTTP status, final URL after redirects, redirect count, response time, sitemap metadata, and a plain issue category when something needs attention.

For a quick first run, keep the prefilled Apify sitemap target and the default Maximum checked URLs value. You will get a focused dataset you can inspect in Apify, export as JSON, CSV, Excel, XML, RSS, or HTML, or consume through the Apify API, schedules, webhooks, and integrations.

✅ What this Actor does

Accepts direct sitemap URLs, sitemap-index URLs, website roots, bare domains, and robots.txt URLs.
Discovers sitemap files from robots.txt and common sitemap paths when you submit a website root.
Parses XML sitemap URL sets, XML sitemap indexes, plain-text sitemaps, and gzipped sitemap responses.
Follows nested sitemap indexes up to your Maximum index depth.
Checks sitemap-listed URLs for HTTP status, redirects, final URL, response time, malformed URLs, and network issues.
Preserves sitemap-native lastmod, changefreq, and priority values when the source sitemap provides them.
Saves one dataset row per checked sitemap-listed URL.
Logs empty or unreachable targets without saving placeholder rows.

This Actor validates URLs that are already listed in public sitemap files. It does not crawl arbitrary internal links, scrape page content, generate sitemaps, submit sitemaps to search engines, or check whether search engines have indexed a URL.

📊 Data you get

Each dataset item represents one checked URL from a parsed sitemap. Rows include:

pageUrl - URL listed in the sitemap.
host - host parsed from the listed URL.
sourceSitemapUrl - sitemap file that declared the URL.
parentSitemapIndexUrl - sitemap index that linked to the source sitemap, or null.
indexDepth - depth of the source sitemap below the submitted or discovered target.
sitemapLastmod, changefreq, and priority - sitemap metadata when present.
urlStatus - ok, redirect, broken, timeout, or malformed.
httpStatus - observed HTTP status, or null when no response was available.
finalUrl - final URL after redirects, or null when unavailable.
redirectCount - number of redirects followed.
responseTimeMs - elapsed time for the URL check.
issue - issue category and message, or null for healthy URLs.

🚀 How to run it

Open the Input tab.
Add one or more sitemap, website, domain, or robots.txt targets.
Keep Maximum checked URLs small for your first run, then raise it when the output looks right.
Use Maximum index depth to control nested sitemap-index expansion. Use 0 to check only the submitted sitemap target.
Run the Actor and open the dataset.

No cookies, login credentials, source API key, or custom proxy settings are needed from you. Targets must expose public sitemap assets over http or https.

✍️ Input example

{
	"targets": [
		"https://apify.com/sitemap.xml",
		"https://apify.com",
		"example.com/robots.txt"
	],
	"maxCheckedUrls": 550,
	"maxIndexDepth": 2
}

Sitemap or website targets is the only required input. You can mix known sitemap URLs, sitemap indexes, website roots, bare domains, and robots.txt URLs in the same run.

Maximum checked URLs caps how many sitemap-listed URLs are checked across all targets. Large sitemap indexes can contain thousands of URLs, so this limit keeps first runs predictable.

Maximum index depth controls how many sitemap-index levels are followed. A value of 2 covers common sitemap index structures. A value of 0 keeps validation to the submitted or directly discovered sitemap.

📦 Output example

{
	"pageUrl": "https://apify.com/actors",
	"host": "apify.com",
	"sourceSitemapUrl": "https://apify.com/sitemap/pages.xml",
	"parentSitemapIndexUrl": "https://apify.com/sitemap.xml",
	"indexDepth": 1,
	"sitemapLastmod": "2026-06-20T15:31:00.000Z",
	"changefreq": "weekly",
	"priority": 0.8,
	"urlStatus": "redirect",
	"httpStatus": 301,
	"finalUrl": "https://apify.com/store",
	"redirectCount": 1,
	"responseTimeMs": 184,
	"issue": {
		"category": "redirect",
		"message": "Sitemap URL redirects to a different final URL."
	}
}

Healthy URLs use urlStatus: "ok" and issue: null. Redirects, broken responses, timeouts, network issues, and malformed sitemap-listed URLs are still saved as validation results because they are the rows you need to review.

💳 Pricing

This Actor uses pay-per-event pricing. You are charged once for each sitemap-listed URL checked and saved to the dataset. The pricing event is called Checked URL.

Failed target discovery, unreachable sitemap files, empty sitemaps, and invalid submitted targets are logged and skipped instead of being saved as charged output rows.

⚠️ Limits and caveats

Sitemap files must be publicly reachable over http or https.
The Actor checks URLs listed in sitemaps. It does not crawl pages that are not listed in a sitemap.
Sitemap metadata is only as complete as the source file. Missing lastmod, changefreq, or priority values are returned as null.
Very large sitemap indexes can contain many child sitemaps and URLs. Use Maximum checked URLs and Maximum index depth to keep runs bounded.
HTTP status and response time are observed at run time and can change as the source site changes.
The Actor reports URL health signals. It does not prove that Google, Bing, or another search engine has indexed the URL.

❓ FAQ

No. Sitemap Validator reads public sitemap assets and checks public URLs. You do not need to provide cookies, login credentials, a source API key, or custom proxy settings.

🧭 Can it crawl my whole website?

No. It checks URLs found in sitemap files. If you need a rendered page crawl and link map, use Website URL Crawler.

🧩 Can it validate sitemap indexes?

Yes. The Actor parses sitemap indexes and follows child sitemaps up to your Maximum index depth.

📉 Why did my run save no rows?

The submitted target may not expose a public sitemap, the sitemap may be empty, or the target may be unreachable at run time. Those cases are logged and skipped instead of creating placeholder dataset rows.

📝 Changelog

0.0: Initial release.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

🔗 Other actors

Sitemap Sniffer ↗ - Find sitemap files and export sitemap URL inventory before validation.
Website URL Crawler ↗ - Crawl rendered public pages and export a website link map.
Webpage Text Extractor ↗ - Extract clean text or Markdown from public web pages.
SSL Certificate Checker ↗ - Check public HTTPS certificates, expiry, trust, and TLS details.
Robots.txt Generator ↗ - Generate deployable robots.txt files with sitemap directives.

Made with ❤️ by Maxime Dupré

Sitemap URL Extractor - XML Sitemap Scraper

benthepythondev/sitemap-url-extractor

Extract URLs from XML sitemaps and sitemap indexes. Get URL, lastmod, changefreq, priority and source sitemap.

Ben

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser — fast and cheap.

Walid

Sitemap URL Extractor

fetch_cat/sitemap-url-extractor

Extract clean URL inventories from XML sitemaps and sitemap indexes with filters, deduplication, and metadata.

Hanna Nosova

Sitemap Analyzer — Parse, Validate & Check URLs

accurate_pouch/sitemap-analyzer

Parse XML sitemaps, extract all URLs, validate structure (priority, changefreq, lastmod), optionally check HTTP status of every URL. Supports sitemap indexes.

Manchitt Sanan

Sitemap API

vivid_astronaut/sitemap

Fabio Suizu

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

273

Sitemap to URL Crawler — Extract Sitemap.xml URLs

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Logiover

Sitemap Extractor

cerebral_aluminum/sitemap-extractor

Extract all URLs from website sitemaps. Pages, images, PDFs. Handles sitemap indexes and WordPress.

Benny

Xml Sitemap Validator

zerobreak/xml-sitemap-validator

XML sitemap validator that crawls every URL in your sitemap and flags broken links, redirect chains, and structural errors — so SEO teams can audit sitemap health in seconds.

ZeroBreak

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a website link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).