Sitemap URL Status Auditor
Pricing
Pay per event
Sitemap URL Status Auditor
Audit XML sitemaps for broken URLs, redirects, HTTP status codes, response timing, content type, canonical tags, and robots metadata.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
Audit every URL listed in XML sitemaps and sitemap indexes for HTTP status, redirects, response timing, content type, and optional SEO metadata.
Use this actor when you need a repeatable sitemap checker for migrations, releases, content QA, and technical SEO monitoring.
What does Sitemap URL Status Auditor do?
Sitemap URL Status Auditor starts from one or more XML sitemap URLs, sitemap indexes, website roots, or domains.
It downloads sitemap XML files, follows nested sitemap indexes, extracts <loc> URLs, deduplicates them, and checks each listed page URL.
For each URL, it records status code, final URL, redirect count, redirect chain, response time, content type, content length, and a normalized error category.
Optionally, it can fetch page HTML to extract canonical URLs and robots meta tags.
Who is it for?
SEO specialists use it to catch broken URLs in sitemaps before search engines waste crawl budget.
Web QA teams use it after deployments to confirm that sitemap URLs still resolve.
Migration teams use it to check final URLs and redirect counts after domain, CMS, or URL-structure changes.
Agencies use it for recurring client health checks and exportable audit evidence.
Developers use it as a fast HTTP-only smoke test for public sitemap quality.
Why use this actor?
🗺️ It is sitemap-first, not just a generic URL checker.
🔁 It recursively expands sitemap indexes.
🧹 It deduplicates URLs before checking them.
🚦 It uses HEAD first with GET fallback for servers that reject HEAD.
📊 It returns one clean dataset table for export, dashboards, and alerts.
⚙️ It includes concurrency, caps, timeout, retry, and polite User-Agent controls.
What data can you extract?
| Field | Description |
|---|---|
url | Page URL discovered in the sitemap. |
sourceSitemap | Sitemap XML file where the URL was found. |
sitemapDepth | Recursion depth inside sitemap indexes. |
statusCode | HTTP status code or null for network failures. |
ok | True for 2xx and 3xx responses. |
method | HEAD, GET, SITEMAP, or NONE. |
finalUrl | Final URL after redirects. |
redirectCount | Number of redirects followed. |
redirectChain | Redirect URLs exposed by the HTTP client. |
contentType | Content-Type response header. |
contentLength | Content-Length header when available. |
responseTimeMs | Request duration in milliseconds. |
errorCategory | none, http_error, timeout, dns_error, tls_error, network_error, parse_error, blocked, or not_checked. |
errorMessage | Human-readable error message. |
canonicalUrl | Canonical link when metadata extraction is enabled. |
robotsMeta | Robots meta tag when metadata extraction is enabled. |
xRobotsTag | X-Robots-Tag response header. |
checkedAt | ISO timestamp of the check. |
How much does it cost to audit sitemap URL status?
This actor uses pay-per-event pricing.
There is a $0.005 start event for each run and a per-URL result event for each dataset row produced.
| Plan tier | Per URL result |
|---|---|
| Free | $0.000029952 |
| Starter / Bronze | $0.000026046 |
| Scale / Silver | $0.000020316 |
| Business / Gold | $0.000015627 |
| Platinum | $0.000010418 |
| Diamond | $0.000010000 |
For most users, cost scales with the number of sitemap URLs checked.
Use maxUrls for small first tests and increase it after you confirm the sitemap source is correct.
How to use it
- Open the actor on Apify.
- Add one or more sitemap URLs, website roots, or domains.
- Set
maxUrlsto the number of URLs you want to audit. - Keep
headFirstenabled for faster status checks. - Enable
includePageMetadataonly when you need canonical and robots meta extraction. - Run the actor.
- Export the dataset as JSON, CSV, Excel, or connect it to your workflow.
Input settings
Sitemap URLs or websites
Use startUrls for XML sitemap URLs, sitemap index URLs, website roots, or domains.
Examples:
https://example.com/sitemap.xmlhttps://example.com/sitemap_index.xmlhttps://example.com/example.com
Website roots and bare domains automatically resolve to /sitemap.xml.
Additional domains
Use domains when you want a simple list of extra domains in addition to startUrls.
Each domain is converted to a sitemap URL.
Maximum URLs
maxUrls controls the maximum unique page URLs audited.
Start with 100 for a cheap test.
Increase to 1,000 or more for full-site checks.
Maximum sitemap files
maxSitemaps prevents very large sitemap indexes from expanding forever.
Large ecommerce sites can have hundreds of sitemap files.
Maximum sitemap index depth
maxDepth controls how deeply nested sitemap indexes are followed.
The default of 3 is enough for normal sitemap structures.
Concurrency
concurrency controls how many URL checks run in parallel.
Use lower values for small or fragile sites.
Use higher values for durable sites and faster audits.
Request timeout and retries
requestTimeoutSecs and maxRetries balance speed and reliability.
Timeouts are recorded as dataset rows rather than crashing the whole run.
Use HEAD before GET
headFirst checks URLs with HEAD first and falls back to GET when needed.
This keeps most audits fast and lightweight.
Follow redirects
followRedirects records the final URL and redirect count.
This is useful for migrations and canonicalization checks.
Include page metadata
includePageMetadata fetches full HTML pages with GET and extracts canonical and robots meta tags.
Enable it for deeper SEO audits.
Leave it off for faster pure status checks.
User-Agent
The default User-Agent identifies the actor politely.
You can override it for internal policies or target-specific requirements.
Output example
{"url": "https://example.com/about/","sourceSitemap": "https://example.com/sitemap.xml","sitemapDepth": 0,"statusCode": 200,"ok": true,"method": "HEAD","finalUrl": "https://example.com/about/","redirectCount": 0,"redirectChain": [],"contentType": "text/html; charset=utf-8","contentLength": 12345,"responseTimeMs": 184,"errorCategory": "none","errorMessage": null,"canonicalUrl": null,"robotsMeta": null,"xRobotsTag": null,"checkedAt": "2026-06-27T00:00:00.000Z"}
Common workflows
Broken sitemap URL audit
Run the actor with maxUrls set to your sitemap size.
Filter output where ok is false.
Review errorCategory and errorMessage.
Redirect migration QA
Run before and after a migration.
Compare finalUrl, redirectCount, and statusCode.
Flag URLs with long redirect chains or unexpected final domains.
Canonical and robots review
Enable includePageMetadata.
Filter rows where canonical URLs are missing, unexpected, or off-domain.
Review robotsMeta and xRobotsTag for accidental noindex directives.
Release smoke test
Schedule a small run after deployment.
Use maxUrls to audit the most important sitemap subset.
Send failures into Slack, email, or a dashboard with Apify integrations.
Integrations
Connect the dataset to Google Sheets for SEO reports.
Use Apify webhooks to send failed URL rows to monitoring systems.
Pull results with the Apify API into Looker Studio, BigQuery, Snowflake, or your internal QA tools.
Run it from CI/CD after a website deployment.
Use recurring tasks for daily or weekly sitemap status monitoring.
API usage
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: process.env.APIFY_TOKEN });const run = await client.actor('automation-lab/sitemap-url-status-auditor').call({startUrls: [{ url: 'https://apify.com/sitemap.xml' }],maxUrls: 100,});console.log(run.defaultDatasetId);
Python
from apify_client import ApifyClientimport osclient = ApifyClient(os.environ['APIFY_TOKEN'])run = client.actor('automation-lab/sitemap-url-status-auditor').call(run_input={'startUrls': [{'url': 'https://apify.com/sitemap.xml'}],'maxUrls': 100,})print(run['defaultDatasetId'])
cURL
curl -X POST 'https://api.apify.com/v2/acts/automation-lab~sitemap-url-status-auditor/runs?token=YOUR_TOKEN' \-H 'Content-Type: application/json' \-d '{"startUrls":[{"url":"https://apify.com/sitemap.xml"}],"maxUrls":100}'
MCP usage
Use this actor from MCP-compatible clients through Apify MCP Server.
Claude Desktop MCP URL:
https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor
Claude Code MCP URL:
https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor
Claude Code setup command:
$claude mcp add apify-sitemap-auditor https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor
Claude Desktop JSON config example:
{"mcpServers": {"apify-sitemap-auditor": {"url": "https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor"}}}
Example prompts:
- "Audit this sitemap and summarize broken URLs."
- "Check redirect counts for URLs in this sitemap index."
- "Find sitemap URLs that return 404, 500, timeout, or blocked responses."
- "Run a canonical and robots metadata audit for this sitemap."
Tips for best results
Start with a small maxUrls value.
Use exact sitemap XML URLs when you know them.
Reduce concurrency when a site returns 429 or intermittent errors.
Enable metadata extraction only when canonical or robots tags matter.
Keep sitemap and URL caps aligned with your budget.
Review sitemap error rows; they often reveal invalid sitemap indexes or blocked XML files.
Troubleshooting
The actor says no <loc> URLs were found
Check that the input URL points to XML sitemap content, not an HTML page.
If you entered a website root, confirm /sitemap.xml exists.
Many URLs show blocked or 403
Lower concurrency and use a clear User-Agent.
Some sites block automated HEAD requests; the actor falls back to GET for common blocked HEAD statuses.
Metadata fields are empty
Canonical and robots meta fields require includePageMetadata to be enabled.
Headers such as xRobotsTag can still appear without metadata mode.
The run is slower than expected
Large sitemap indexes, slow target servers, metadata extraction, redirects, and retries increase runtime.
Lower maxUrls or increase concurrency carefully.
Data quality notes
HTTP status checks reflect the response seen during the run.
Target websites can rate-limit, geo-vary, or serve different responses to different clients.
The actor records those outcomes instead of hiding them.
Redirect chains depend on what the HTTP client exposes after following redirects.
Legality and ethics
This actor is designed for public XML sitemaps and public URLs.
Use it on websites you own, manage, audit, or are otherwise authorized to check.
Respect target website terms, rate limits, and robots guidance.
Reduce concurrency if a site appears stressed or rate-limited.
Related scrapers and tools
FAQ
Can this actor audit sitemap indexes?
Yes. It detects sitemap indexes and recursively expands nested sitemap files up to maxDepth and maxSitemaps.
Does it need a browser?
No. It is an HTTP-only actor for public XML and URL checks.
Can it audit password-protected staging sites?
Not in the default workflow. Public unauthenticated URLs are the intended use case.
Does it use proxies?
No proxy is required by default. It performs plain HTTP requests from the actor runtime.
Can I schedule recurring audits?
Yes. Use Apify tasks and schedules to run the same input daily, weekly, or after deployments.
Can I export results?
Yes. Apify datasets export to JSON, CSV, Excel, XML, RSS, and HTML table formats.
How do I audit more than one domain?
Add multiple sitemap URLs to startUrls or put additional domains in domains.
What counts as OK?
The ok field is true for HTTP 2xx and 3xx responses.
What happens when a sitemap URL is broken?
The actor emits an error row for the sitemap itself with method SITEMAP and a normalized error category.
What happens when a page URL times out?
The actor emits a row for that URL with statusCode null, ok false, and errorCategory set to timeout.