Sitemap Analyzer — Recursive Parse, Health Check, AI Tags avatar

Sitemap Analyzer — Recursive Parse, Health Check, AI Tags

Pricing

Pay per usage

Go to Apify Store
Sitemap Analyzer — Recursive Parse, Health Check, AI Tags

Sitemap Analyzer — Recursive Parse, Health Check, AI Tags

Parse any sitemap.xml recursively, extract all URLs with metadata, check HTTP health status, and optionally cluster URLs by topic using Claude AI. Perfect for SEO audits and site migration.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Andrei

Andrei

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Sitemap Analyzer

Parse any sitemap recursively, extract every URL with metadata, check HTTP health, and optionally cluster URLs by topic with AI. Built for SEO audits, site migrations, and competitor analysis.

What this actor does

Most sitemap extractors just give you a flat list of URLs. This actor goes further:

  • Smart resolution — accepts a bare domain, a subfolder URL, or a direct sitemap URL, and finds the right sitemap for each
  • Recursive parsing — handles nested sitemap-index files (parent to child) automatically
  • Gzip support — transparently decompresses .xml.gz sitemaps at any nesting level
  • URL analysis — depth, path segments, and source sitemap for every URL
  • HTTP health checks — verify each URL returns 200 and measure response time (optional)
  • AI clustering — group URLs into topic clusters with Claude (optional, bring your own key)
  • Tree mode — map a site's sitemap structure instead of extracting URLs
  • Anti-bot resilience — browser-grade fingerprinting, request throttling, and automatic retry on soft-bans

Quick start

Provide a URL and run:

{
"siteUrl": "https://example.com"
}

The actor checks robots.txt, tries common sitemap paths, parses everything it finds recursively, and returns every URL with metadata.

You can also point it straight at a sitemap:

{
"sitemapUrl": "https://example.com/sitemap_index.xml"
}

How the sitemap is resolved

The actor accepts three kinds of entry and resolves each differently:

  1. Bare domain (example.com) — reads robots.txt, then probes common paths (/sitemap.xml, /sitemap_index.xml, /sitemap.xml.gz).
  2. Subfolder URL (example.com/blog) — filters robots.txt sitemap directives by the subpath, then probes subfolder-specific paths (/blog/sitemap.xml, /blog/sitemaps/sitemap.xml, including .gz). Falls back to the root sitemap only as a last resort, and flags it with a warning.
  3. Direct sitemap URL (sitemapUrl field, or any .xml / .xml.gz link) — used as-is, no auto-detection.

The first item in the dataset is always a manifest record describing exactly how the sitemap was resolved, including the full resolution trail. This makes it easy to see why a given sitemap was chosen.

Input fields

  • siteUrl (required) — Domain, subfolder URL, or direct sitemap URL
  • sitemapUrl — Explicit sitemap URL; overrides all auto-detection
  • maxUrls — Limit total URLs returned (0 = unlimited, default 1000)
  • maxSitemaps — Limit how many child sitemaps are walked (0 = unlimited)
  • maxDepth — Max sitemap-index nesting depth to follow (default 6)
  • requestDelayMs — Pause between sitemap fetches (default 300; raise to 800-1500 for strict anti-bot sites)
  • enableHealthCheck — HEAD-request each URL to check it returns 200 (default false)
  • healthCheckConcurrency — Parallel health checks (default 10, max 50)
  • enableAiClustering — Group URLs by topic with Claude (default false)
  • anthropicApiKey — Your Anthropic API key (BYOK, required if AI clustering enabled)
  • respectRobotsTxt — Honor robots.txt directives (default true)
  • sitemapTreeMode — Analyze child sitemaps instead of extracting URLs (default false)
  • sampleUrlsPerSitemap — In tree mode, sample this many URLs per child sitemap (0-100)
  • useProxy — Route requests through Apify proxy (default false)
  • proxyTypedatacenter (fast, cheap) or residential (bypasses more anti-bot systems)

Output format

Every dataset item has a type field. There are four types:

manifest — always the first item. Describes how the sitemap was resolved:

{
"type": "manifest",
"entryUrl": "https://example.com/blog",
"rootDomain": "https://example.com",
"resolvedSitemapUrl": "https://example.com/blog/sitemap.xml",
"resolutionMethod": "subpath-common-path",
"resolutionTrail": ["entry: ...", "fetching robots.txt: ...", "..."],
"sitemapsDiscovered": 1,
"warning": null,
"mode": "urls"
}

url — one per extracted URL (URL mode):

{
"type": "url",
"url": "https://example.com/blog/post-1",
"loc": "https://example.com/blog/post-1",
"lastmod": "2025-12-01",
"changefreq": "weekly",
"priority": "0.8",
"depth": "2",
"pathSegments": "blog/post-1",
"fromSitemap": "https://example.com/sitemap.xml",
"statusCode": "200",
"responseTimeMs": "234",
"contentType": "text/html",
"aiCluster": "blog-post"
}

statusCode, responseTimeMs, contentType appear only with health check enabled. aiCluster appears only with AI clustering enabled.

sitemap — one per child sitemap (tree mode):

{
"type": "sitemap",
"sitemapUrl": "https://example.com/sitemaps/sitemap0.xml.gz",
"urlCount": "50000",
"sitemapDepth": "1",
"parentSitemap": "https://example.com/sitemap.xml",
"sampleUrls": "https://... | https://... | https://..."
}

summary — always the last item. Totals and why the run stopped:

{
"type": "summary",
"totalUrls": 12500,
"sitemapsParsed": 8,
"stoppedReason": "completed"
}

stoppedReason can be completed, maxUrls reached, maxSitemaps reached, or maxDepth reached.

Sitemap tree mode

Instead of extracting individual URLs, tree mode maps the structure of a site's sitemap. For each child sitemap it returns the URL, how many URLs it contains, its depth, its parent, and optionally a few sample URLs.

This is built for competitor scale analysis. A large site may have a sitemap index pointing to 160 child sitemaps; tree mode tells you instantly "160 sitemaps, 8M URLs total" and lets you peek at what each one holds, without downloading millions of URLs.

Enable it with sitemapTreeMode: true. Set sampleUrlsPerSitemap to 3-10 to also get example URLs from each child sitemap. Results are written to the dataset progressively, one sitemap at a time, so partial output survives even if a run is interrupted.

Anti-bot handling

The actor uses browser-grade request fingerprinting, which gets past many basic anti-bot defenses that block plain HTTP clients. For stricter sites it also has two more layers:

  • ThrottlingrequestDelayMs spaces out sitemap requests. Strict sites with anti-bot protection start returning HTML decoy pages instead of XML under request-rate pressure; raising the delay to 800-1500ms greatly reduces this.
  • Soft-ban retry — when a server returns HTTP 200 but the body is an HTML anti-bot page rather than a sitemap, the actor detects it and retries that request with exponential backoff.
  • Proxy — set useProxy: true and choose proxyType. datacenter is fast and cheap; residential bypasses more systems but is slower and costs more.

Limitation: no approach bypasses full JavaScript-challenge systems (Cloudflare/DataDome interactive challenges). Sites with that level of protection cannot be reached by any non-browser tool.

Technical notes

  • Handles nested sitemap-index files recursively, up to maxDepth levels
  • Transparently decompresses gzip-compressed sitemaps (.xml.gz) at any level
  • Respects maxUrls, maxSitemaps, and maxDepth as hard limits
  • Health checks run with configurable parallelism
  • AI clustering uses Claude Haiku (BYOK)

Use cases

SEO audit — find dead pages (404), slow pages, and pages missing from your sitemap.

Site migration — get the complete URL inventory before moving CMS.

Competitive research — map a competitor's full site structure and scale with tree mode.

Sitemap validation — verify that every URL in your sitemap returns 200.

Support

Found a bug or have a feature request? Contact the developer through the actor's page on Apify.