Pricing

Pay per usage

Sitemap Analyzer — Recursive Parse, Health Check, AI Tags

Parse any sitemap.xml recursively, extract all URLs with metadata, check HTTP health status, and optionally cluster URLs by topic using Claude AI. Perfect for SEO audits and site migration.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Andrei

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Sitemap Analyzer

Parse any sitemap recursively, extract every URL with metadata, check HTTP health, and optionally cluster URLs by topic with AI. Built for SEO audits, site migrations, and competitor analysis.

What this actor does

Most sitemap extractors just give you a flat list of URLs. This actor goes further:

Smart resolution — accepts a bare domain, a subfolder URL, or a direct sitemap URL, and finds the right sitemap for each
Recursive parsing — handles nested sitemap-index files (parent to child) automatically
Gzip support — transparently decompresses .xml.gz sitemaps at any nesting level
URL analysis — depth, path segments, and source sitemap for every URL
HTTP health checks — verify each URL returns 200 and measure response time (optional)
AI clustering — group URLs into topic clusters with Claude (optional, bring your own key)
Tree mode — map a site's sitemap structure instead of extracting URLs
Anti-bot resilience — browser-grade fingerprinting, request throttling, and automatic retry on soft-bans

Quick start

Provide a URL and run:

{
  "siteUrl": "https://example.com"
}

The actor checks robots.txt, tries common sitemap paths, parses everything it finds recursively, and returns every URL with metadata.

You can also point it straight at a sitemap:

{
  "sitemapUrl": "https://example.com/sitemap_index.xml"
}

How the sitemap is resolved

The actor accepts three kinds of entry and resolves each differently:

Bare domain (example.com) — reads robots.txt, then probes common paths (/sitemap.xml, /sitemap_index.xml, /sitemap.xml.gz).
Subfolder URL (example.com/blog) — filters robots.txt sitemap directives by the subpath, then probes subfolder-specific paths (/blog/sitemap.xml, /blog/sitemaps/sitemap.xml, including .gz). Falls back to the root sitemap only as a last resort, and flags it with a warning.
Direct sitemap URL (sitemapUrl field, or any .xml / .xml.gz link) — used as-is, no auto-detection.

The first item in the dataset is always a manifest record describing exactly how the sitemap was resolved, including the full resolution trail. This makes it easy to see why a given sitemap was chosen.

Input fields

siteUrl (required) — Domain, subfolder URL, or direct sitemap URL
sitemapUrl — Explicit sitemap URL; overrides all auto-detection
maxUrls — Limit total URLs returned (0 = unlimited, default 1000)
maxSitemaps — Limit how many child sitemaps are walked (0 = unlimited)
maxDepth — Max sitemap-index nesting depth to follow (default 6)
requestDelayMs — Pause between sitemap fetches (default 300; raise to 800-1500 for strict anti-bot sites)
enableHealthCheck — HEAD-request each URL to check it returns 200 (default false)
healthCheckConcurrency — Parallel health checks (default 10, max 50)
enableAiClustering — Group URLs by topic with Claude (default false)
anthropicApiKey — Your Anthropic API key (BYOK, required if AI clustering enabled)
respectRobotsTxt — Honor robots.txt directives (default true)
sitemapTreeMode — Analyze child sitemaps instead of extracting URLs (default false)
sampleUrlsPerSitemap — In tree mode, sample this many URLs per child sitemap (0-100)
useProxy — Route requests through Apify proxy (default false)
proxyType — datacenter (fast, cheap) or residential (bypasses more anti-bot systems)

Output format

Every dataset item has a type field. There are four types:

manifest — always the first item. Describes how the sitemap was resolved:

{
  "type": "manifest",
  "entryUrl": "https://example.com/blog",
  "rootDomain": "https://example.com",
  "resolvedSitemapUrl": "https://example.com/blog/sitemap.xml",
  "resolutionMethod": "subpath-common-path",
  "resolutionTrail": ["entry: ...", "fetching robots.txt: ...", "..."],
  "sitemapsDiscovered": 1,
  "warning": null,
  "mode": "urls"
}

url — one per extracted URL (URL mode):

{
  "type": "url",
  "url": "https://example.com/blog/post-1",
  "loc": "https://example.com/blog/post-1",
  "lastmod": "2025-12-01",
  "changefreq": "weekly",
  "priority": "0.8",
  "depth": "2",
  "pathSegments": "blog/post-1",
  "fromSitemap": "https://example.com/sitemap.xml",
  "statusCode": "200",
  "responseTimeMs": "234",
  "contentType": "text/html",
  "aiCluster": "blog-post"
}

statusCode, responseTimeMs, contentType appear only with health check enabled. aiCluster appears only with AI clustering enabled.

sitemap — one per child sitemap (tree mode):

{
  "type": "sitemap",
  "sitemapUrl": "https://example.com/sitemaps/sitemap0.xml.gz",
  "urlCount": "50000",
  "sitemapDepth": "1",
  "parentSitemap": "https://example.com/sitemap.xml",
  "sampleUrls": "https://... | https://... | https://..."
}

summary — always the last item. Totals and why the run stopped:

{
  "type": "summary",
  "totalUrls": 12500,
  "sitemapsParsed": 8,
  "stoppedReason": "completed"
}

stoppedReason can be completed, maxUrls reached, maxSitemaps reached, or maxDepth reached.

Sitemap tree mode

Instead of extracting individual URLs, tree mode maps the structure of a site's sitemap. For each child sitemap it returns the URL, how many URLs it contains, its depth, its parent, and optionally a few sample URLs.

This is built for competitor scale analysis. A large site may have a sitemap index pointing to 160 child sitemaps; tree mode tells you instantly "160 sitemaps, 8M URLs total" and lets you peek at what each one holds, without downloading millions of URLs.

Enable it with sitemapTreeMode: true. Set sampleUrlsPerSitemap to 3-10 to also get example URLs from each child sitemap. Results are written to the dataset progressively, one sitemap at a time, so partial output survives even if a run is interrupted.

Anti-bot handling

The actor uses browser-grade request fingerprinting, which gets past many basic anti-bot defenses that block plain HTTP clients. For stricter sites it also has two more layers:

Throttling — requestDelayMs spaces out sitemap requests. Strict sites with anti-bot protection start returning HTML decoy pages instead of XML under request-rate pressure; raising the delay to 800-1500ms greatly reduces this.
Soft-ban retry — when a server returns HTTP 200 but the body is an HTML anti-bot page rather than a sitemap, the actor detects it and retries that request with exponential backoff.
Proxy — set useProxy: true and choose proxyType. datacenter is fast and cheap; residential bypasses more systems but is slower and costs more.

Limitation: no approach bypasses full JavaScript-challenge systems (Cloudflare/DataDome interactive challenges). Sites with that level of protection cannot be reached by any non-browser tool.

Technical notes

Handles nested sitemap-index files recursively, up to maxDepth levels
Transparently decompresses gzip-compressed sitemaps (.xml.gz) at any level
Respects maxUrls, maxSitemaps, and maxDepth as hard limits
Health checks run with configurable parallelism
AI clustering uses Claude Haiku (BYOK)

Use cases

SEO audit — find dead pages (404), slow pages, and pages missing from your sitemap.

Site migration — get the complete URL inventory before moving CMS.

Competitive research — map a competitor's full site structure and scale with tree mode.

Sitemap validation — verify that every URL in your sitemap returns 200.

Support

Found a bug or have a feature request? Contact the developer through the actor's page on Apify.

Sitemap Analyzer — Parse, Validate & Check URLs

accurate_pouch/sitemap-analyzer

Parse XML sitemaps, extract all URLs, validate structure (priority, changefreq, lastmod), optionally check HTTP status of every URL. Supports sitemap indexes.

Manchitt Sanan

Sitemap Health Validator

predictable_function/my-actor

Validates sitemap.xml files and checks health of listed URLs

riya rawat

5.0

Sitemap Crawler - XML Sitemap URL Extractor

miccho27/sitemap-crawler

Extract all URLs from XML sitemaps (including sitemap index) and optionally audit each page

Tatsuya Mizuno

Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG

logiover/sitemap-to-url-crawler

Instantly extract all public URLs from any website sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest cheap way to build URL lists for RAG pipelines, LLM training datasets, SEO audits and content inventories. Zero-config, no proxy required.

Logiover

XML Sitemap URL Extractor

andok/sitemap-extractor

Recursively crawl and extract every single URL from a website’s sitemap.xml. Automate your SEO audits and scraping queues.

Andok

Sitemap URL Extractor

wiry_kingdom/sitemap-url-extractor

Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.

Mohieldin Mohamed

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

263

Sitemap Generator - Crawl Website & Create XML Sitemap

scrappy_garden/sitemap-generator

Generate an XML sitemap for any website. Crawls internal pages from start URLs (with depth + page limits), deduplicates URLs, and stores a ready-to-submit sitemap.xml plus a structured dataset and summary for SEO audits.

Bikram Adhikari

Sitemap & URL Discovery - Find All URLs on Any Site

santamaria-automations/sitemap-url-discovery

Discover every URL on any website by parsing sitemap.xml, robots.txt, and sitemap indexes. Extract URLs with last modified dates, change frequency, and priority. Perfect for SEO audits, content analysis, crawling preparation, and site mapping.

Ale

Sitemap Url Extractor

scrapers-hub/sitemap-url-extractor

Sitemap URL extractor to extract all URLs from XML sitemaps quickly and efficiently 🌐📄 Ideal for SEO audits, site analysis, and indexing workflows. Fast, accurate, and easy to use.