Sitemap URL Status Auditor avatar

Sitemap URL Status Auditor

Pricing

Pay per event

Go to Apify Store
Sitemap URL Status Auditor

Sitemap URL Status Auditor

Audit XML sitemaps for broken URLs, redirects, HTTP status codes, response timing, content type, canonical tags, and robots metadata.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

Audit every URL listed in XML sitemaps and sitemap indexes for HTTP status, redirects, response timing, content type, and optional SEO metadata.

Use this actor when you need a repeatable sitemap checker for migrations, releases, content QA, and technical SEO monitoring.

What does Sitemap URL Status Auditor do?

Sitemap URL Status Auditor starts from one or more XML sitemap URLs, sitemap indexes, website roots, or domains.

It downloads sitemap XML files, follows nested sitemap indexes, extracts <loc> URLs, deduplicates them, and checks each listed page URL.

For each URL, it records status code, final URL, redirect count, redirect chain, response time, content type, content length, and a normalized error category.

Optionally, it can fetch page HTML to extract canonical URLs and robots meta tags.

Who is it for?

SEO specialists use it to catch broken URLs in sitemaps before search engines waste crawl budget.

Web QA teams use it after deployments to confirm that sitemap URLs still resolve.

Migration teams use it to check final URLs and redirect counts after domain, CMS, or URL-structure changes.

Agencies use it for recurring client health checks and exportable audit evidence.

Developers use it as a fast HTTP-only smoke test for public sitemap quality.

Why use this actor?

🗺️ It is sitemap-first, not just a generic URL checker.

🔁 It recursively expands sitemap indexes.

🧹 It deduplicates URLs before checking them.

🚦 It uses HEAD first with GET fallback for servers that reject HEAD.

📊 It returns one clean dataset table for export, dashboards, and alerts.

⚙️ It includes concurrency, caps, timeout, retry, and polite User-Agent controls.

What data can you extract?

FieldDescription
urlPage URL discovered in the sitemap.
sourceSitemapSitemap XML file where the URL was found.
sitemapDepthRecursion depth inside sitemap indexes.
statusCodeHTTP status code or null for network failures.
okTrue for 2xx and 3xx responses.
methodHEAD, GET, SITEMAP, or NONE.
finalUrlFinal URL after redirects.
redirectCountNumber of redirects followed.
redirectChainRedirect URLs exposed by the HTTP client.
contentTypeContent-Type response header.
contentLengthContent-Length header when available.
responseTimeMsRequest duration in milliseconds.
errorCategorynone, http_error, timeout, dns_error, tls_error, network_error, parse_error, blocked, or not_checked.
errorMessageHuman-readable error message.
canonicalUrlCanonical link when metadata extraction is enabled.
robotsMetaRobots meta tag when metadata extraction is enabled.
xRobotsTagX-Robots-Tag response header.
checkedAtISO timestamp of the check.

How much does it cost to audit sitemap URL status?

This actor uses pay-per-event pricing.

There is a $0.005 start event for each run and a per-URL result event for each dataset row produced.

Plan tierPer URL result
Free$0.000029952
Starter / Bronze$0.000026046
Scale / Silver$0.000020316
Business / Gold$0.000015627
Platinum$0.000010418
Diamond$0.000010000

For most users, cost scales with the number of sitemap URLs checked.

Use maxUrls for small first tests and increase it after you confirm the sitemap source is correct.

How to use it

  1. Open the actor on Apify.
  2. Add one or more sitemap URLs, website roots, or domains.
  3. Set maxUrls to the number of URLs you want to audit.
  4. Keep headFirst enabled for faster status checks.
  5. Enable includePageMetadata only when you need canonical and robots meta extraction.
  6. Run the actor.
  7. Export the dataset as JSON, CSV, Excel, or connect it to your workflow.

Input settings

Sitemap URLs or websites

Use startUrls for XML sitemap URLs, sitemap index URLs, website roots, or domains.

Examples:

  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml
  • https://example.com/
  • example.com

Website roots and bare domains automatically resolve to /sitemap.xml.

Additional domains

Use domains when you want a simple list of extra domains in addition to startUrls.

Each domain is converted to a sitemap URL.

Maximum URLs

maxUrls controls the maximum unique page URLs audited.

Start with 100 for a cheap test.

Increase to 1,000 or more for full-site checks.

Maximum sitemap files

maxSitemaps prevents very large sitemap indexes from expanding forever.

Large ecommerce sites can have hundreds of sitemap files.

Maximum sitemap index depth

maxDepth controls how deeply nested sitemap indexes are followed.

The default of 3 is enough for normal sitemap structures.

Concurrency

concurrency controls how many URL checks run in parallel.

Use lower values for small or fragile sites.

Use higher values for durable sites and faster audits.

Request timeout and retries

requestTimeoutSecs and maxRetries balance speed and reliability.

Timeouts are recorded as dataset rows rather than crashing the whole run.

Use HEAD before GET

headFirst checks URLs with HEAD first and falls back to GET when needed.

This keeps most audits fast and lightweight.

Follow redirects

followRedirects records the final URL and redirect count.

This is useful for migrations and canonicalization checks.

Include page metadata

includePageMetadata fetches full HTML pages with GET and extracts canonical and robots meta tags.

Enable it for deeper SEO audits.

Leave it off for faster pure status checks.

User-Agent

The default User-Agent identifies the actor politely.

You can override it for internal policies or target-specific requirements.

Output example

{
"url": "https://example.com/about/",
"sourceSitemap": "https://example.com/sitemap.xml",
"sitemapDepth": 0,
"statusCode": 200,
"ok": true,
"method": "HEAD",
"finalUrl": "https://example.com/about/",
"redirectCount": 0,
"redirectChain": [],
"contentType": "text/html; charset=utf-8",
"contentLength": 12345,
"responseTimeMs": 184,
"errorCategory": "none",
"errorMessage": null,
"canonicalUrl": null,
"robotsMeta": null,
"xRobotsTag": null,
"checkedAt": "2026-06-27T00:00:00.000Z"
}

Common workflows

Broken sitemap URL audit

Run the actor with maxUrls set to your sitemap size.

Filter output where ok is false.

Review errorCategory and errorMessage.

Redirect migration QA

Run before and after a migration.

Compare finalUrl, redirectCount, and statusCode.

Flag URLs with long redirect chains or unexpected final domains.

Canonical and robots review

Enable includePageMetadata.

Filter rows where canonical URLs are missing, unexpected, or off-domain.

Review robotsMeta and xRobotsTag for accidental noindex directives.

Release smoke test

Schedule a small run after deployment.

Use maxUrls to audit the most important sitemap subset.

Send failures into Slack, email, or a dashboard with Apify integrations.

Integrations

Connect the dataset to Google Sheets for SEO reports.

Use Apify webhooks to send failed URL rows to monitoring systems.

Pull results with the Apify API into Looker Studio, BigQuery, Snowflake, or your internal QA tools.

Run it from CI/CD after a website deployment.

Use recurring tasks for daily or weekly sitemap status monitoring.

API usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('automation-lab/sitemap-url-status-auditor').call({
startUrls: [{ url: 'https://apify.com/sitemap.xml' }],
maxUrls: 100,
});
console.log(run.defaultDatasetId);

Python

from apify_client import ApifyClient
import os
client = ApifyClient(os.environ['APIFY_TOKEN'])
run = client.actor('automation-lab/sitemap-url-status-auditor').call(run_input={
'startUrls': [{'url': 'https://apify.com/sitemap.xml'}],
'maxUrls': 100,
})
print(run['defaultDatasetId'])

cURL

curl -X POST 'https://api.apify.com/v2/acts/automation-lab~sitemap-url-status-auditor/runs?token=YOUR_TOKEN' \
-H 'Content-Type: application/json' \
-d '{"startUrls":[{"url":"https://apify.com/sitemap.xml"}],"maxUrls":100}'

MCP usage

Use this actor from MCP-compatible clients through Apify MCP Server.

Claude Desktop MCP URL:

https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor

Claude Code MCP URL:

https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor

Claude Code setup command:

$claude mcp add apify-sitemap-auditor https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor

Claude Desktop JSON config example:

{
"mcpServers": {
"apify-sitemap-auditor": {
"url": "https://mcp.apify.com/?tools=automation-lab/sitemap-url-status-auditor"
}
}
}

Example prompts:

  • "Audit this sitemap and summarize broken URLs."
  • "Check redirect counts for URLs in this sitemap index."
  • "Find sitemap URLs that return 404, 500, timeout, or blocked responses."
  • "Run a canonical and robots metadata audit for this sitemap."

Tips for best results

Start with a small maxUrls value.

Use exact sitemap XML URLs when you know them.

Reduce concurrency when a site returns 429 or intermittent errors.

Enable metadata extraction only when canonical or robots tags matter.

Keep sitemap and URL caps aligned with your budget.

Review sitemap error rows; they often reveal invalid sitemap indexes or blocked XML files.

Troubleshooting

The actor says no <loc> URLs were found

Check that the input URL points to XML sitemap content, not an HTML page.

If you entered a website root, confirm /sitemap.xml exists.

Many URLs show blocked or 403

Lower concurrency and use a clear User-Agent.

Some sites block automated HEAD requests; the actor falls back to GET for common blocked HEAD statuses.

Metadata fields are empty

Canonical and robots meta fields require includePageMetadata to be enabled.

Headers such as xRobotsTag can still appear without metadata mode.

The run is slower than expected

Large sitemap indexes, slow target servers, metadata extraction, redirects, and retries increase runtime.

Lower maxUrls or increase concurrency carefully.

Data quality notes

HTTP status checks reflect the response seen during the run.

Target websites can rate-limit, geo-vary, or serve different responses to different clients.

The actor records those outcomes instead of hiding them.

Redirect chains depend on what the HTTP client exposes after following redirects.

Legality and ethics

This actor is designed for public XML sitemaps and public URLs.

Use it on websites you own, manage, audit, or are otherwise authorized to check.

Respect target website terms, rate limits, and robots guidance.

Reduce concurrency if a site appears stressed or rate-limited.

FAQ

Can this actor audit sitemap indexes?

Yes. It detects sitemap indexes and recursively expands nested sitemap files up to maxDepth and maxSitemaps.

Does it need a browser?

No. It is an HTTP-only actor for public XML and URL checks.

Can it audit password-protected staging sites?

Not in the default workflow. Public unauthenticated URLs are the intended use case.

Does it use proxies?

No proxy is required by default. It performs plain HTTP requests from the actor runtime.

Can I schedule recurring audits?

Yes. Use Apify tasks and schedules to run the same input daily, weekly, or after deployments.

Can I export results?

Yes. Apify datasets export to JSON, CSV, Excel, XML, RSS, and HTML table formats.

How do I audit more than one domain?

Add multiple sitemap URLs to startUrls or put additional domains in domains.

What counts as OK?

The ok field is true for HTTP 2xx and 3xx responses.

What happens when a sitemap URL is broken?

The actor emits an error row for the sitemap itself with method SITEMAP and a normalized error category.

What happens when a page URL times out?

The actor emits a row for that URL with statusCode null, ok false, and errorCategory set to timeout.