Sitemap Validator avatar

Sitemap Validator

Pricing

$0.90 / 1,000 checked urls

Go to Apify Store
Sitemap Validator

Sitemap Validator

Validate XML sitemaps and sitemap indexes. Check listed URLs for HTTP status, redirects, final URL, response time, malformed URLs, and sitemap metadata.

Pricing

$0.90 / 1,000 checked urls

Rating

0.0

(0)

Developer

Maxime Dupré

Maxime Dupré

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

🗺️ Sitemap validator for URL health checks

Sitemap Validator checks public XML sitemaps, sitemap indexes, website roots, bare domains, and robots.txt files. Add a target such as apify.com/sitemap.xml, and the Actor parses the sitemap, follows child sitemap indexes within your depth limit, checks the listed URLs, and saves one row per checked URL.

Use this sitemap validator when you need a fast technical SEO check before a migration, release, crawl-budget review, client audit, or broken-link cleanup. Each row keeps the page URL, source sitemap URL, parent sitemap index, HTTP status, final URL after redirects, redirect count, response time, sitemap metadata, and a plain issue category when something needs attention.

For a quick first run, keep the prefilled Apify sitemap target and the default Maximum checked URLs value. You will get a focused dataset you can inspect in Apify, export as JSON, CSV, Excel, XML, RSS, or HTML, or consume through the Apify API, schedules, webhooks, and integrations.

✅ What this Actor does

  • Accepts direct sitemap URLs, sitemap-index URLs, website roots, bare domains, and robots.txt URLs.
  • Discovers sitemap files from robots.txt and common sitemap paths when you submit a website root.
  • Parses XML sitemap URL sets, XML sitemap indexes, plain-text sitemaps, and gzipped sitemap responses.
  • Follows nested sitemap indexes up to your Maximum index depth.
  • Checks sitemap-listed URLs for HTTP status, redirects, final URL, response time, malformed URLs, and network issues.
  • Preserves sitemap-native lastmod, changefreq, and priority values when the source sitemap provides them.
  • Saves one dataset row per checked sitemap-listed URL.
  • Logs empty or unreachable targets without saving placeholder rows.

This Actor validates URLs that are already listed in public sitemap files. It does not crawl arbitrary internal links, scrape page content, generate sitemaps, submit sitemaps to search engines, or check whether search engines have indexed a URL.

📊 Data you get

Each dataset item represents one checked URL from a parsed sitemap. Rows include:

  • pageUrl - URL listed in the sitemap.
  • host - host parsed from the listed URL.
  • sourceSitemapUrl - sitemap file that declared the URL.
  • parentSitemapIndexUrl - sitemap index that linked to the source sitemap, or null.
  • indexDepth - depth of the source sitemap below the submitted or discovered target.
  • sitemapLastmod, changefreq, and priority - sitemap metadata when present.
  • urlStatus - ok, redirect, broken, timeout, or malformed.
  • httpStatus - observed HTTP status, or null when no response was available.
  • finalUrl - final URL after redirects, or null when unavailable.
  • redirectCount - number of redirects followed.
  • responseTimeMs - elapsed time for the URL check.
  • issue - issue category and message, or null for healthy URLs.

🚀 How to run it

  1. Open the Input tab.
  2. Add one or more sitemap, website, domain, or robots.txt targets.
  3. Keep Maximum checked URLs small for your first run, then raise it when the output looks right.
  4. Use Maximum index depth to control nested sitemap-index expansion. Use 0 to check only the submitted sitemap target.
  5. Run the Actor and open the dataset.

No cookies, login credentials, source API key, or custom proxy settings are needed from you. Targets must expose public sitemap assets over http or https.

✍️ Input example

{
"targets": [
"https://apify.com/sitemap.xml",
"https://apify.com",
"example.com/robots.txt"
],
"maxCheckedUrls": 550,
"maxIndexDepth": 2
}

Sitemap or website targets is the only required input. You can mix known sitemap URLs, sitemap indexes, website roots, bare domains, and robots.txt URLs in the same run.

Maximum checked URLs caps how many sitemap-listed URLs are checked across all targets. Large sitemap indexes can contain thousands of URLs, so this limit keeps first runs predictable.

Maximum index depth controls how many sitemap-index levels are followed. A value of 2 covers common sitemap index structures. A value of 0 keeps validation to the submitted or directly discovered sitemap.

📦 Output example

{
"pageUrl": "https://apify.com/actors",
"host": "apify.com",
"sourceSitemapUrl": "https://apify.com/sitemap/pages.xml",
"parentSitemapIndexUrl": "https://apify.com/sitemap.xml",
"indexDepth": 1,
"sitemapLastmod": "2026-06-20T15:31:00.000Z",
"changefreq": "weekly",
"priority": 0.8,
"urlStatus": "redirect",
"httpStatus": 301,
"finalUrl": "https://apify.com/store",
"redirectCount": 1,
"responseTimeMs": 184,
"issue": {
"category": "redirect",
"message": "Sitemap URL redirects to a different final URL."
}
}

Healthy URLs use urlStatus: "ok" and issue: null. Redirects, broken responses, timeouts, network issues, and malformed sitemap-listed URLs are still saved as validation results because they are the rows you need to review.

💳 Pricing

This Actor uses pay-per-event pricing. You are charged once for each sitemap-listed URL checked and saved to the dataset. The pricing event is called Checked URL.

Failed target discovery, unreachable sitemap files, empty sitemaps, and invalid submitted targets are logged and skipped instead of being saved as charged output rows.

⚠️ Limits and caveats

  • Sitemap files must be publicly reachable over http or https.
  • The Actor checks URLs listed in sitemaps. It does not crawl pages that are not listed in a sitemap.
  • Sitemap metadata is only as complete as the source file. Missing lastmod, changefreq, or priority values are returned as null.
  • Very large sitemap indexes can contain many child sitemaps and URLs. Use Maximum checked URLs and Maximum index depth to keep runs bounded.
  • HTTP status and response time are observed at run time and can change as the source site changes.
  • The Actor reports URL health signals. It does not prove that Google, Bing, or another search engine has indexed the URL.

❓ FAQ

🔐 Do I need login credentials or an API key?

No. Sitemap Validator reads public sitemap assets and checks public URLs. You do not need to provide cookies, login credentials, a source API key, or custom proxy settings.

🧭 Can it crawl my whole website?

No. It checks URLs found in sitemap files. If you need a rendered page crawl and link map, use Website URL Crawler.

🧩 Can it validate sitemap indexes?

Yes. The Actor parses sitemap indexes and follows child sitemaps up to your Maximum index depth.

📉 Why did my run save no rows?

The submitted target may not expose a public sitemap, the sitemap may be empty, or the target may be unreachable at run time. Those cases are logged and skipped instead of creating placeholder dataset rows.

📝 Changelog

  • 0.0: Initial release.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

🔗 Other actors

Made with ❤️ by Maxime Dupré