# Sitemap Structure Analyzer (`onescales/sitemap-structure-analyzer`) Actor

Analyze any website's sitemap in seconds using sitemap.xml data. Get URL counts by type (product, blog, docs), content freshness, URL patterns, and SEO anomalies — no page fetching required.

- **URL**: https://apify.com/onescales/sitemap-structure-analyzer.md
- **Developed by:** [One Scales](https://apify.com/onescales) (community)
- **Categories:** SEO tools, E-commerce, Developer tools
- **Stats:** 5 total users, 4 monthly users, 81.8% runs succeeded, 3 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

**Sitemap Structure Analyzer** tells you what a website is made of — without fetching a single page. Point it at any domain and get back a full breakdown: how many products, blog, docs pages, and utility URLs the site has; what URL templates drive it; how fresh the content is; and where the anomalies are.

Works on sites from 50 URLs to hundreds of thousands of URLs. Runs in seconds. No page fetching means no rate limits, no IP blocks, and no scraping legality concerns — just pure structural analysis of public sitemap data.

Pairs naturally with the **Sitemap URL Extractor** (get the raw URLs) and the **Bulk AI Markdown Maker** (scrape only the pages you actually need).

**Use cases include:**
- **SEO audits** — surface stale content, utility URLs that shouldn't be indexed, and low-lastmod coverage across entire sites
- **Competitive research** — understand a competitor's content shape before investing in a content strategy
- **AI / RAG pipeline building** — identify exactly what URL types and sections to include before scraping
- **Agency prospecting** — bulk-check potential client sites for content mix and content freshness
- **Content strategy benchmarking** — compare your site's product/blog/docs ratio against competitors
- **Technical SEO QA** — detect account, cart, and search filter URLs appearing in sitemaps where they shouldn't

---

### Features

- **Content type classification** — every URL labeled as one of: `product`, `blog`, `documentation`, `profile`, `category`, `page`, `media`, `other` (utility/auth/search), or `unclassified`
- **Site archetype detection** — labels the site as `ecommerce`, `content`, `documentation`, `community`, `marketing`, or `general` based on its dominant content type
- **Docs-context awareness** — sites on `docs.`, `developer.`, `api.`, `support.`, `help.`, `learn.`, `wiki.`, `kb.`, or `knowledgebase.` subdomains, or with a meaningful share of `/docs/`-style paths, get a smarter classification pass that recognizes modern documentation platforms (Mintlify, Docusaurus, Nextra, GitBook)
- **URL pattern detection** — groups URLs into templates (e.g. `/products/{slug}`) with counts, dominant classification, and example URLs. Patterns with only one URL are suppressed
- **Freshness analysis** — lastmod coverage, newest/oldest URLs, content velocity (30/90/365-day windows), stale URL counts (1+/2+/3+ years), posting cadence by section
- **Anomaly detection** — flags utility URLs in sitemaps, stale content concentration, low lastmod coverage, and silent zero-result sitemaps
- **Sitemap index support** — automatically fetches and recurses through all child sitemaps in a sitemap index, with cycle protection
- **Proxy support** — residential proxy by default with automatic no-proxy fallback for small sites with bot protection that blocks proxied requests
- **Budget capping** — caps domains processed to stay within your configured budget

---

### How to Use

#### Input

| Field | Type | Required | Description |
|---|---|---|---|
| `domains` | String list | Yes | Domains to analyze. Accepts `example.com`, `https://example.com`, `www.example.com`, or a direct sitemap URL like `https://example.com/sitemap.xml`. |
| `maxUrls` | Integer | No | Cap on URL processing per domain. `0` = no cap. Useful for very large sites. |
| `proxyConfiguration` | Object | No | Proxy settings. Residential proxy recommended and set by default. For sites with Cloudflare/WAF bot protection, pinning the proxy to a specific country (e.g. `US`) often improves reliability. |

**Example input:**

```json
{
    "domains": ["onescales.com", "shopify.com"],
    "maxUrls": 0,
    "proxyConfiguration": {
        "useApifyProxy": true,
        "apifyProxyGroups": ["RESIDENTIAL"]
    }
}
````

#### Output

One row per domain. Every row includes:

| Field | Description |
|---|---|
| `domain` | Analyzed domain |
| `sitemapUrl` | Sitemap URL that was used |
| `sitemapType` | `index` (sitemap index) or `urlset` (single sitemap) |
| `childSitemaps` | Child sitemap URLs (only present when `sitemapType` is `index`) |
| `analyzedAt` | ISO timestamp |
| `error` | Error message if analysis failed |
| `summary` | Total URLs, classified/unclassified counts, classification coverage, site archetype |
| `byType` | URL counts and percentages per content type |
| `bySection` | URL counts per top-level path prefix |
| `urlPatterns` | Detected URL templates with count, classification, and examples (top 20, minimum 2 URLs each) |
| `freshness` | lastmod coverage, content velocity, stale URL breakdown, posting cadence by section |
| `anomalies` | Detected anomalies with severity, count, and description |

**Example output row** (Shopify ecommerce site):

```json
{
    "domain": "onescales.com",
    "sitemapUrl": "https://www.onescales.com/sitemap.xml",
    "sitemapType": "index",
    "analyzedAt": "2026-05-19T10:23:11.940Z",
    "summary": {
        "totalUrls": 464,
        "classified": 464,
        "unclassified": 0,
        "classificationCoverage": 1,
        "siteArchetype": "ecommerce"
    },
    "byType": {
        "product":       { "count": 239, "percentage": 51.5 },
        "blog":     { "count": 96,  "percentage": 20.7 },
        "category":      { "count": 63,  "percentage": 13.6 },
        "page":          { "count": 66,  "percentage": 14.2 },
        "documentation": { "count": 0,   "percentage": 0 },
        "profile":       { "count": 0,   "percentage": 0 },
        "media":         { "count": 0,   "percentage": 0 },
        "other":         { "count": 0,   "percentage": 0 },
        "unclassified":  { "count": 0,   "percentage": 0 }
    },
    "bySection": {
        "/products/":    239,
        "/blog/":       98,
        "/pages/":       64,
        "/collections/": 61
    },
    "urlPatterns": [
        {
            "pattern": "/products/{slug}",
            "count": 239,
            "classification": "product",
            "examples": ["/products/table", "/products/the-perfect-day"]
        },
        {
            "pattern": "/blog/{slug}",
            "count": 98,
            "classification": "blog",
            "examples": ["/blog/resources/are-we-here", "/blog/resources/bank-look"]
        }
    ],
    "freshness": {
        "lastmodCoverage": 1,
        "newestUrlLastmod": "2026-05-19",
        "oldestUrlLastmod": "2019-02-12",
        "contentVelocity": {
            "urlsModifiedLast30Days": 301,
            "urlsModifiedLast90Days": 302,
            "urlsModifiedLast365Days": 309
        },
        "staleUrls": {
            "olderThan1Year": 153,
            "olderThan2Years": 115,
            "olderThan3Years": 102
        },
        "postingCadenceBySection": {
            "/products/":    "approx 19.9 updates per month over last 12 months",
            "/collections/": "approx 5.1 updates per month over last 12 months",
            "/blog/":       "approx 0.3 updates per month over last 12 months"
        }
    },
    "anomalies": [
        {
            "type": "stale_content_concentration",
            "severity": "low",
            "count": 102,
            "description": "102 URLs have not been modified in over 3 years. Consider auditing for relevance."
        }
    ]
}
```

#### Anomaly Types

| Type | Triggered When | Severity |
|---|---|---|
| `other_urls_in_sitemap` | Utility URLs (auth, cart, search filters) appear in the sitemap and typically shouldn't be indexed | `medium` (≤20 URLs) or `high` (>20) |
| `stale_content_concentration` | More than 50 URLs haven't been modified in over 3 years | `low` (≤500) or `high` (>500) |
| `low_lastmod_coverage` | Fewer than 30% of URLs have a `lastmod` date (on sites with 50+ URLs) | `low` |
| `sitemap_returned_no_entries` | Sitemap was fetched but no URLs were extracted (proxy blocking, parse failure, or all child sitemaps empty) | `high` |

***

### Tips

- **Sitemap index sites** — for large sites with a sitemap index, the actor automatically fetches and aggregates all child sitemaps
- **Large sites** — use `maxUrls` to cap processing and control costs on sites with 100,000+ URLs
- **No sitemap found** — the actor checks `robots.txt` first, then falls back to `/sitemap.xml` and `/sitemap_index.xml`. If none work, the row will contain an error message
- **Sites with bot protection** — small WordPress sites and Cloudflare-fronted docs sites sometimes block residential proxies. The actor automatically retries without the proxy when an XML fetch returns non-XML. If a domain still fails, try setting the proxy to a country-specific residential group (e.g. `RESIDENTIAL` pinned to `US`)
- **Direct sitemap URLs** — you can pass a full sitemap URL like `https://example.com/sitemaps/posts.xml` to skip discovery and analyze only that sitemap

***

### Support

For bugs, feature requests, or questions — reach us at https://docs.google.com/forms/d/e/1FAIpQLSfsKyzZ3nRED7mML47I4LAfNh\_mBwkuFMp1FgYYJ4AkDRgaRw/viewform?usp=dialog

### Related Keywords

sitemap analyzer, sitemap structure, sitemap data, sitemap.xml data, sitemap analysis, website structure analyzer, content type classifier, URL classifier, SEO audit, site audit, sitemap extractor, content velocity, sitemap freshness, stale content, URL pattern analyzer, competitive research, RAG pipeline, AI dataset builder, site architecture, content shape, bulk sitemap analyzer, sitemap index, sitemap structure, docs site analyzer, documentation analyzer, actor, AI, API, apify, at scale, automated, automation, batch, bulk, checker, crawler, CSV, dataset, detector, Excel, export, extractor, finder, generator, Google Sheets, JSON, lookup, make, make.com, MCP, n8n, no-code, no API key required, parser, pipeline, report, scanner, schedule, scheduled, scraper, spreadsheet, tool, validator, webhook, workflow, XML, zapier

# Actor input Schema

## `domains` (type: `array`):

Domains to analyze. Accepts example.com, https://example.com, or www.example.com.

## `maxUrls` (type: `integer`):

Cap on URL processing per domain. 0 = no cap. Useful for massive sites.

## `proxyConfiguration` (type: `object`):

Proxy settings for fetching sitemaps.

## Actor input object example

```json
{
  "domains": [
    "onescales.com"
  ],
  "maxUrls": 0,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}
```

# Actor output Schema

## `analysisResults` (type: `string`):

All sitemap structure analysis data.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "domains": [
        "onescales.com"
    ],
    "proxyConfiguration": {
        "useApifyProxy": true,
        "apifyProxyGroups": [
            "RESIDENTIAL"
        ]
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("onescales/sitemap-structure-analyzer").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "domains": ["onescales.com"],
    "proxyConfiguration": {
        "useApifyProxy": True,
        "apifyProxyGroups": ["RESIDENTIAL"],
    },
}

# Run the Actor and wait for it to finish
run = client.actor("onescales/sitemap-structure-analyzer").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "domains": [
    "onescales.com"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ]
  }
}' |
apify call onescales/sitemap-structure-analyzer --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=onescales/sitemap-structure-analyzer",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap Structure Analyzer",
        "description": "Analyze any website's sitemap in seconds using sitemap.xml data. Get URL counts by type (product, blog, docs), content freshness, URL patterns, and SEO anomalies — no page fetching required.",
        "version": "1.0",
        "x-build-id": "bsunDwuVUoTDwMIEf"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/onescales~sitemap-structure-analyzer/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-onescales-sitemap-structure-analyzer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/onescales~sitemap-structure-analyzer/runs": {
            "post": {
                "operationId": "runs-sync-onescales-sitemap-structure-analyzer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/onescales~sitemap-structure-analyzer/run-sync": {
            "post": {
                "operationId": "run-sync-onescales-sitemap-structure-analyzer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "domains"
                ],
                "properties": {
                    "domains": {
                        "title": "Domains",
                        "type": "array",
                        "description": "Domains to analyze. Accepts example.com, https://example.com, or www.example.com.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxUrls": {
                        "title": "Max URLs to Analyze Per Domain",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Cap on URL processing per domain. 0 = no cap. Useful for massive sites.",
                        "default": 0
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Proxy settings for fetching sitemaps."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
