Sitemap Structure Analyzer avatar

Sitemap Structure Analyzer

Pricing

Pay per usage

Go to Apify Store
Sitemap Structure Analyzer

Sitemap Structure Analyzer

Analyze any website's sitemap in seconds using sitemap.xml data. Get URL counts by type (product, blog, docs), content freshness, URL patterns, and SEO anomalies — no page fetching required.

Pricing

Pay per usage

Rating

5.0

(2)

Developer

One Scales

One Scales

Maintained by Community

Actor stats

3

Bookmarked

5

Total users

4

Monthly active users

4 days ago

Last modified

Share

Sitemap Structure Analyzer tells you what a website is made of — without fetching a single page. Point it at any domain and get back a full breakdown: how many products, blog, docs pages, and utility URLs the site has; what URL templates drive it; how fresh the content is; and where the anomalies are.

Works on sites from 50 URLs to hundreds of thousands of URLs. Runs in seconds. No page fetching means no rate limits, no IP blocks, and no scraping legality concerns — just pure structural analysis of public sitemap data.

Pairs naturally with the Sitemap URL Extractor (get the raw URLs) and the Bulk AI Markdown Maker (scrape only the pages you actually need).

Use cases include:

  • SEO audits — surface stale content, utility URLs that shouldn't be indexed, and low-lastmod coverage across entire sites
  • Competitive research — understand a competitor's content shape before investing in a content strategy
  • AI / RAG pipeline building — identify exactly what URL types and sections to include before scraping
  • Agency prospecting — bulk-check potential client sites for content mix and content freshness
  • Content strategy benchmarking — compare your site's product/blog/docs ratio against competitors
  • Technical SEO QA — detect account, cart, and search filter URLs appearing in sitemaps where they shouldn't

Features

  • Content type classification — every URL labeled as one of: product, blog, documentation, profile, category, page, media, other (utility/auth/search), or unclassified
  • Site archetype detection — labels the site as ecommerce, content, documentation, community, marketing, or general based on its dominant content type
  • Docs-context awareness — sites on docs., developer., api., support., help., learn., wiki., kb., or knowledgebase. subdomains, or with a meaningful share of /docs/-style paths, get a smarter classification pass that recognizes modern documentation platforms (Mintlify, Docusaurus, Nextra, GitBook)
  • URL pattern detection — groups URLs into templates (e.g. /products/{slug}) with counts, dominant classification, and example URLs. Patterns with only one URL are suppressed
  • Freshness analysis — lastmod coverage, newest/oldest URLs, content velocity (30/90/365-day windows), stale URL counts (1+/2+/3+ years), posting cadence by section
  • Anomaly detection — flags utility URLs in sitemaps, stale content concentration, low lastmod coverage, and silent zero-result sitemaps
  • Sitemap index support — automatically fetches and recurses through all child sitemaps in a sitemap index, with cycle protection
  • Proxy support — residential proxy by default with automatic no-proxy fallback for small sites with bot protection that blocks proxied requests
  • Budget capping — caps domains processed to stay within your configured budget

How to Use

Input

FieldTypeRequiredDescription
domainsString listYesDomains to analyze. Accepts example.com, https://example.com, www.example.com, or a direct sitemap URL like https://example.com/sitemap.xml.
maxUrlsIntegerNoCap on URL processing per domain. 0 = no cap. Useful for very large sites.
proxyConfigurationObjectNoProxy settings. Residential proxy recommended and set by default. For sites with Cloudflare/WAF bot protection, pinning the proxy to a specific country (e.g. US) often improves reliability.

Example input:

{
"domains": ["onescales.com", "shopify.com"],
"maxUrls": 0,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Output

One row per domain. Every row includes:

FieldDescription
domainAnalyzed domain
sitemapUrlSitemap URL that was used
sitemapTypeindex (sitemap index) or urlset (single sitemap)
childSitemapsChild sitemap URLs (only present when sitemapType is index)
analyzedAtISO timestamp
errorError message if analysis failed
summaryTotal URLs, classified/unclassified counts, classification coverage, site archetype
byTypeURL counts and percentages per content type
bySectionURL counts per top-level path prefix
urlPatternsDetected URL templates with count, classification, and examples (top 20, minimum 2 URLs each)
freshnesslastmod coverage, content velocity, stale URL breakdown, posting cadence by section
anomaliesDetected anomalies with severity, count, and description

Example output row (Shopify ecommerce site):

{
"domain": "onescales.com",
"sitemapUrl": "https://www.onescales.com/sitemap.xml",
"sitemapType": "index",
"analyzedAt": "2026-05-19T10:23:11.940Z",
"summary": {
"totalUrls": 464,
"classified": 464,
"unclassified": 0,
"classificationCoverage": 1,
"siteArchetype": "ecommerce"
},
"byType": {
"product": { "count": 239, "percentage": 51.5 },
"blog": { "count": 96, "percentage": 20.7 },
"category": { "count": 63, "percentage": 13.6 },
"page": { "count": 66, "percentage": 14.2 },
"documentation": { "count": 0, "percentage": 0 },
"profile": { "count": 0, "percentage": 0 },
"media": { "count": 0, "percentage": 0 },
"other": { "count": 0, "percentage": 0 },
"unclassified": { "count": 0, "percentage": 0 }
},
"bySection": {
"/products/": 239,
"/blog/": 98,
"/pages/": 64,
"/collections/": 61
},
"urlPatterns": [
{
"pattern": "/products/{slug}",
"count": 239,
"classification": "product",
"examples": ["/products/table", "/products/the-perfect-day"]
},
{
"pattern": "/blog/{slug}",
"count": 98,
"classification": "blog",
"examples": ["/blog/resources/are-we-here", "/blog/resources/bank-look"]
}
],
"freshness": {
"lastmodCoverage": 1,
"newestUrlLastmod": "2026-05-19",
"oldestUrlLastmod": "2019-02-12",
"contentVelocity": {
"urlsModifiedLast30Days": 301,
"urlsModifiedLast90Days": 302,
"urlsModifiedLast365Days": 309
},
"staleUrls": {
"olderThan1Year": 153,
"olderThan2Years": 115,
"olderThan3Years": 102
},
"postingCadenceBySection": {
"/products/": "approx 19.9 updates per month over last 12 months",
"/collections/": "approx 5.1 updates per month over last 12 months",
"/blog/": "approx 0.3 updates per month over last 12 months"
}
},
"anomalies": [
{
"type": "stale_content_concentration",
"severity": "low",
"count": 102,
"description": "102 URLs have not been modified in over 3 years. Consider auditing for relevance."
}
]
}

Anomaly Types

TypeTriggered WhenSeverity
other_urls_in_sitemapUtility URLs (auth, cart, search filters) appear in the sitemap and typically shouldn't be indexedmedium (≤20 URLs) or high (>20)
stale_content_concentrationMore than 50 URLs haven't been modified in over 3 yearslow (≤500) or high (>500)
low_lastmod_coverageFewer than 30% of URLs have a lastmod date (on sites with 50+ URLs)low
sitemap_returned_no_entriesSitemap was fetched but no URLs were extracted (proxy blocking, parse failure, or all child sitemaps empty)high

Tips

  • Sitemap index sites — for large sites with a sitemap index, the actor automatically fetches and aggregates all child sitemaps
  • Large sites — use maxUrls to cap processing and control costs on sites with 100,000+ URLs
  • No sitemap found — the actor checks robots.txt first, then falls back to /sitemap.xml and /sitemap_index.xml. If none work, the row will contain an error message
  • Sites with bot protection — small WordPress sites and Cloudflare-fronted docs sites sometimes block residential proxies. The actor automatically retries without the proxy when an XML fetch returns non-XML. If a domain still fails, try setting the proxy to a country-specific residential group (e.g. RESIDENTIAL pinned to US)
  • Direct sitemap URLs — you can pass a full sitemap URL like https://example.com/sitemaps/posts.xml to skip discovery and analyze only that sitemap

Support

For bugs, feature requests, or questions — reach us at https://docs.google.com/forms/d/e/1FAIpQLSfsKyzZ3nRED7mML47I4LAfNh_mBwkuFMp1FgYYJ4AkDRgaRw/viewform?usp=dialog

sitemap analyzer, sitemap structure, sitemap data, sitemap.xml data, sitemap analysis, website structure analyzer, content type classifier, URL classifier, SEO audit, site audit, sitemap extractor, content velocity, sitemap freshness, stale content, URL pattern analyzer, competitive research, RAG pipeline, AI dataset builder, site architecture, content shape, bulk sitemap analyzer, sitemap index, sitemap structure, docs site analyzer, documentation analyzer, actor, AI, API, apify, at scale, automated, automation, batch, bulk, checker, crawler, CSV, dataset, detector, Excel, export, extractor, finder, generator, Google Sheets, JSON, lookup, make, make.com, MCP, n8n, no-code, no API key required, parser, pipeline, report, scanner, schedule, scheduled, scraper, spreadsheet, tool, validator, webhook, workflow, XML, zapier