Pricing

Pay per usage

Try for free

Go to Apify Store

Sitemap Structure Analyzer

Try for free

Analyze any website's sitemap in seconds using sitemap.xml data. Get URL counts by type (product, blog, docs), content freshness, URL patterns, and SEO anomalies — no page fetching required.

Pricing

Pay per usage

Rating

5.0

(2)

Developer

One Scales

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

Features

Content type classification — every URL labeled as one of: product, blog, documentation, profile, category, page, media, other (utility/auth/search), or unclassified
Site archetype detection — labels the site as ecommerce, content, documentation, community, marketing, or general based on its dominant content type
Docs-context awareness — sites on docs., developer., api., support., help., learn., wiki., kb., or knowledgebase. subdomains, or with a meaningful share of /docs/-style paths, get a smarter classification pass that recognizes modern documentation platforms (Mintlify, Docusaurus, Nextra, GitBook)
URL pattern detection — groups URLs into templates (e.g. /products/{slug}) with counts, dominant classification, and example URLs. Patterns with only one URL are suppressed
Freshness analysis — lastmod coverage, newest/oldest URLs, content velocity (30/90/365-day windows), stale URL counts (1+/2+/3+ years), posting cadence by section
Anomaly detection — flags utility URLs in sitemaps, stale content concentration, low lastmod coverage, and silent zero-result sitemaps
Sitemap index support — automatically fetches and recurses through all child sitemaps in a sitemap index, with cycle protection
Proxy support — residential proxy by default with automatic no-proxy fallback for small sites with bot protection that blocks proxied requests
Budget capping — caps domains processed to stay within your configured budget

How to Use

Input

Field	Type	Required	Description
`domains`	String list	Yes	Domains to analyze. Accepts `example.com`, `https://example.com`, `www.example.com`, or a direct sitemap URL like `https://example.com/sitemap.xml`.
`maxUrls`	Integer	No	Cap on URL processing per domain. `0` = no cap. Useful for very large sites.
`proxyConfiguration`	Object	No	Proxy settings. Residential proxy recommended and set by default. For sites with Cloudflare/WAF bot protection, pinning the proxy to a specific country (e.g. `US`) often improves reliability.

Example input:

{
    "domains": ["onescales.com", "shopify.com"],
    "maxUrls": 0,
    "proxyConfiguration": {
        "useApifyProxy": true,
        "apifyProxyGroups": ["RESIDENTIAL"]
    }
}

Output

One row per domain. Every row includes:

Field	Description
`domain`	Analyzed domain
`sitemapUrl`	Sitemap URL that was used
`sitemapType`	`index` (sitemap index) or `urlset` (single sitemap)
`childSitemaps`	Child sitemap URLs (only present when `sitemapType` is `index`)
`analyzedAt`	ISO timestamp
`error`	Error message if analysis failed
`summary`	Total URLs, classified/unclassified counts, classification coverage, site archetype
`byType`	URL counts and percentages per content type
`bySection`	URL counts per top-level path prefix
`urlPatterns`	Detected URL templates with count, classification, and examples (top 20, minimum 2 URLs each)
`freshness`	lastmod coverage, content velocity, stale URL breakdown, posting cadence by section
`anomalies`	Detected anomalies with severity, count, and description

Example output row (Shopify ecommerce site):

{
    "domain": "onescales.com",
    "sitemapUrl": "https://www.onescales.com/sitemap.xml",
    "sitemapType": "index",
    "analyzedAt": "2026-05-19T10:23:11.940Z",
    "summary": {
        "totalUrls": 464,
        "classified": 464,
        "unclassified": 0,
        "classificationCoverage": 1,
        "siteArchetype": "ecommerce"
    },
    "byType": {
        "product":       { "count": 239, "percentage": 51.5 },
        "blog":     { "count": 96,  "percentage": 20.7 },
        "category":      { "count": 63,  "percentage": 13.6 },
        "page":          { "count": 66,  "percentage": 14.2 },
        "documentation": { "count": 0,   "percentage": 0 },
        "profile":       { "count": 0,   "percentage": 0 },
        "media":         { "count": 0,   "percentage": 0 },
        "other":         { "count": 0,   "percentage": 0 },
        "unclassified":  { "count": 0,   "percentage": 0 }
    },
    "bySection": {
        "/products/":    239,
        "/blog/":       98,
        "/pages/":       64,
        "/collections/": 61
    },
    "urlPatterns": [
        {
            "pattern": "/products/{slug}",
            "count": 239,
            "classification": "product",
            "examples": ["/products/table", "/products/the-perfect-day"]
        },
        {
            "pattern": "/blog/{slug}",
            "count": 98,
            "classification": "blog",
            "examples": ["/blog/resources/are-we-here", "/blog/resources/bank-look"]
        }
    ],
    "freshness": {
        "lastmodCoverage": 1,
        "newestUrlLastmod": "2026-05-19",
        "oldestUrlLastmod": "2019-02-12",
        "contentVelocity": {
            "urlsModifiedLast30Days": 301,
            "urlsModifiedLast90Days": 302,
            "urlsModifiedLast365Days": 309
        },
        "staleUrls": {
            "olderThan1Year": 153,
            "olderThan2Years": 115,
            "olderThan3Years": 102
        },
        "postingCadenceBySection": {
            "/products/":    "approx 19.9 updates per month over last 12 months",
            "/collections/": "approx 5.1 updates per month over last 12 months",
            "/blog/":       "approx 0.3 updates per month over last 12 months"
        }
    },
    "anomalies": [
        {
            "type": "stale_content_concentration",
            "severity": "low",
            "count": 102,
            "description": "102 URLs have not been modified in over 3 years. Consider auditing for relevance."
        }
    ]
}

Anomaly Types

Type	Triggered When	Severity
`other_urls_in_sitemap`	Utility URLs (auth, cart, search filters) appear in the sitemap and typically shouldn't be indexed	`medium` (≤20 URLs) or `high` (>20)
`stale_content_concentration`	More than 50 URLs haven't been modified in over 3 years	`low` (≤500) or `high` (>500)
`low_lastmod_coverage`	Fewer than 30% of URLs have a `lastmod` date (on sites with 50+ URLs)	`low`
`sitemap_returned_no_entries`	Sitemap was fetched but no URLs were extracted (proxy blocking, parse failure, or all child sitemaps empty)	`high`

Tips

Sitemap index sites — for large sites with a sitemap index, the actor automatically fetches and aggregates all child sitemaps
Large sites — use maxUrls to cap processing and control costs on sites with 100,000+ URLs
No sitemap found — the actor checks robots.txt first, then falls back to /sitemap.xml and /sitemap_index.xml. If none work, the row will contain an error message
Sites with bot protection — small WordPress sites and Cloudflare-fronted docs sites sometimes block residential proxies. The actor automatically retries without the proxy when an XML fetch returns non-XML. If a domain still fails, try setting the proxy to a country-specific residential group (e.g. RESIDENTIAL pinned to US)
Direct sitemap URLs — you can pass a full sitemap URL like https://example.com/sitemaps/posts.xml to skip discovery and analyze only that sitemap

Support

For bugs, feature requests, or questions — reach us at https://docs.google.com/forms/d/e/1FAIpQLSfsKyzZ3nRED7mML47I4LAfNh_mBwkuFMp1FgYYJ4AkDRgaRw/viewform?usp=dialog

sitemap analyzer, sitemap structure, sitemap data, sitemap.xml data, sitemap analysis, website structure analyzer, content type classifier, URL classifier, SEO audit, site audit, sitemap extractor, content velocity, sitemap freshness, stale content, URL pattern analyzer, competitive research, RAG pipeline, AI dataset builder, site architecture, content shape, bulk sitemap analyzer, sitemap index, sitemap structure, docs site analyzer, documentation analyzer, actor, AI, API, apify, at scale, automated, automation, batch, bulk, checker, crawler, CSV, dataset, detector, Excel, export, extractor, finder, generator, Google Sheets, JSON, lookup, make, make.com, MCP, n8n, no-code, no API key required, parser, pipeline, report, scanner, schedule, scheduled, scraper, spreadsheet, tool, validator, webhook, workflow, XML, zapier

Sitemap Analyzer

eliai/sitemap-analyzer

Anthony Snider

Sitemap API

vivid_astronaut/sitemap

Fabio Suizu

Sitemap URL Extractor - XML Sitemap Scraper

benthepythondev/sitemap-url-extractor

Extract URLs from XML sitemaps and sitemap indexes. Get URL, lastmod, changefreq, priority and source sitemap.

Ben

Sitemap Extractor: Website → All URLs (sitemap.xml parser)

boxbox10/sitemap-extractor

Give it a website. Get every URL from its sitemap — loc, lastmod, changefreq, priority — as one clean record per URL. Auto-discovers sitemap.xml, robots.txt Sitemap: directives, and nested sitemap indexes. Perfect for SEO audits, crawl seeding, and URL discovery.

Marvin Eguilos

Sitemap URL Extractor - Get Every URL from sitemap.xml

eliai/sitemap-url-extractor

Extract every URL from any sitemap.xml, auto-following nested sitemap index files. Input: startUrls (sitemap URL). Output: JSON records with loc, lastmod, changefreq, priority, sourceSitemap. Cheap pay-per-result: $0.02 per sitemap parsed.

Anthony Snider

Sitemap to URL Crawler — Extract Sitemap.xml URLs

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Logiover

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a website link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).