# Sitemap URL Extractor (`mikolabs/sitemap-url-extractor`) Actor

Extract every URL and its metadata from any sitemap.xml in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with url, lastmod, changefreq, priority, and more — ready to export as CSV, JSON, or Excel.

- **URL**: https://apify.com/mikolabs/sitemap-url-extractor.md
- **Developed by:** [mikolabs](https://apify.com/mikolabs) (community)
- **Categories:** Developer tools, SEO tools, E-commerce
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.50 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Sitemap URL Extractor — Bulk XML Sitemap Parser for SEO & Content Audits

Extract every URL and its metadata from any `sitemap.xml` in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with `url`, `lastmod`, `changefreq`, `priority`, and more — ready to export as CSV, JSON, or Excel.

> 💡 **The most affordable sitemap extractor on Apify.**
> Free plan users get **20 results free**. Paying users pay just **$1.50 per 1,000 results** — a fraction of competing tools.

---

### What This Actor Does

This Actor accepts one or more `sitemap.xml` URLs and:

- Crawls and parses all URLs from standard sitemaps (`urlset`)
- Automatically follows nested **sitemap index files** (`sitemapindex`) up to a configurable depth
- Extracts all standard sitemap fields: `url`, `lastmod`, `changefreq`, `priority`
- Supports **image sitemap** and **Google News sitemap** extensions
- Optionally filters results using a custom **regex pattern**
- Returns structured, export-ready data

---

### Use Cases

#### 🔍 SEO Analysis
Extract every page URL from a website's sitemap to audit indexation, spot orphaned pages, or validate that all key content is discoverable by search engines.

#### 📋 Content Inventory
Build a complete list of all pages on a website before a migration, redesign, or CMS switch. Know exactly what exists before you move it.

#### 🔗 Broken Link Checking
Pull all sitemap URLs and feed them into a link checker to find 404s, redirects, or server errors across your entire site.

#### 🏆 Competitive Analysis
Discover how a competitor structures their website by parsing their public sitemap. Understand which pages they prioritize and how frequently they publish.

#### 📰 Content Monitoring
Track `lastmod` dates across your sitemap over time to monitor publishing frequency and detect stale content.

---

### How to Use

#### Step 1 — Open the Actor
Go to the **Input** tab in the Apify Console.

#### Step 2 — Configure Your Inputs

| Field | Description | Default |
|---|---|---|
| **Sitemap URLs** | One or more `sitemap.xml` URLs to extract from | Required |
| **Max Depth** | How deep to follow nested sitemapindex files | `3` |
| **Request Timeout** | Seconds to wait per request | `30` |
| **Filter URL Pattern** | Optional regex to keep only matching URLs | *(none)* |
| **Proxy Configuration** | Optional proxy for rate-limited sites | *(none)* |

**Example sitemap URL:**
````

https://mikolabs.xyz/sitemap.xml

````

You can add multiple URLs — the Actor processes them all in one run.

#### Step 3 — Run the Actor
Click **Start** and the Actor will crawl your sitemap(s), follow any nested indexes, and collect all URL records into the dataset.

#### Step 4 — Export Your Results
Once the run finishes, go to the **Storage → Dataset** tab and export your data in:
- **CSV** — open directly in Excel or Google Sheets
- **JSON** — use in APIs or pipelines
- **XLSX** — ready-made spreadsheet

---

### Output Example

Each row in the output dataset represents one URL found in the sitemap:

```json
[
  {
    "url": "https://mikolabs.xyz/",
    "lastmod": "2026-04-17",
    "changefreq": "monthly",
    "priority": 1.0,
    "sitemapUrl": "https://mikolabs.xyz/sitemap.xml",
    "sitemapType": "urlset",
    "scrapedAt": "2026-04-18T10:29:37.665452+00:00"
  },
  {
    "url": "https://mikolabs.xyz/apis",
    "lastmod": "2026-04-17",
    "changefreq": "weekly",
    "priority": 0.8,
    "sitemapUrl": "https://mikolabs.xyz/sitemap.xml",
    "sitemapType": "urlset",
    "scrapedAt": "2026-04-18T10:29:37.665452+00:00"
  },
  {
    "url": "https://mikolabs.xyz/pricing",
    "lastmod": "2026-04-17",
    "changefreq": "monthly",
    "priority": 0.7,
    "sitemapUrl": "https://mikolabs.xyz/sitemap.xml",
    "sitemapType": "urlset",
    "scrapedAt": "2026-04-18T10:29:37.665452+00:00"
  }
]
````

#### Output Fields

| Field | Type | Description |
|---|---|---|
| `url` | string | The page URL from the sitemap |
| `lastmod` | string | Last modified date (ISO 8601) |
| `changefreq` | string | How often the page changes (daily, weekly, monthly…) |
| `priority` | number | Page priority relative to the rest of the site (0.0–1.0) |
| `sitemapUrl` | string | The source sitemap this URL was found in |
| `sitemapType` | string | `urlset` or `sitemapindex` |
| `images` | array | Image entries from image sitemap extensions (if present) |
| `news` | object | Google News metadata (if present) |
| `scrapedAt` | string | Timestamp of when the record was collected |

***

### Input Reference

```json
{
  "sitemapUrls": ["https://mikolabs.xyz/sitemap.xml"],
  "maxDepth": 3,
  "requestTimeoutSecs": 30,
  "filterUrlPattern": ""
}
```

**`sitemapUrls`** *(required)*
An array of one or more `sitemap.xml` URLs. Accepts both standard `urlset` sitemaps and `sitemapindex` files that point to other sitemaps.

**`maxDepth`** *(optional, default: 3)*
Controls how many levels of nested sitemap index files the Actor will follow. Set to `1` to only parse the provided sitemaps without following any child links.

**`requestTimeoutSecs`** *(optional, default: 30)*
Maximum time in seconds to wait for each sitemap response. Increase this for slow servers.

**`filterUrlPattern`** *(optional)*
A regular expression to filter which URLs are saved to the dataset. For example, `https://example\.com/blog/.*` will only save blog URLs. Leave empty to collect all URLs.

**`proxyConfiguration`** *(optional)*
Enables Apify proxy rotation to avoid IP blocks on rate-limited websites. Not required for most public sitemaps.

***

### Pricing

| Plan | Price |
|---|---|
| Free | 20 results free |
| Pay-as-you-go / Subscription | **$1.50 per 1,000 results** |

This Actor is among the most competitively priced sitemap extractors on the Apify platform — ideal for one-off audits, scheduled monitoring, and large-scale extractions alike.

***

### Frequently Asked Questions

**Does it support sitemap index files?**
Yes. If your sitemap URL points to a `sitemapindex` (a sitemap of sitemaps), the Actor will automatically follow all child sitemap links up to the configured `maxDepth`.

**Can I extract from multiple sitemaps in one run?**
Yes. Add as many sitemap URLs as you need in the `sitemapUrls` input field — all will be processed in a single run.

**What if the sitemap URL redirects?**
The Actor handles HTTP redirects automatically.

**Can I filter results to only specific URL patterns?**
Yes — use the `filterUrlPattern` field with a regular expression (e.g. `/blog/.*` to keep only blog pages).

**Is the data exportable to Excel or Google Sheets?**
Yes. After the run, export as CSV from the Dataset tab and open it directly in Excel or Google Sheets.

**What happens if a sitemap is behind a bot check?**
Enable the **Proxy Configuration** option to route requests through Apify's residential or datacenter proxies.

# Actor input Schema

## `sitemapUrls` (type: `array`):

List of sitemap.xml URLs to extract data from (e.g. https://onescales.com/sitemap.xml). Supports nested sitemapindex files.

## `maxDepth` (type: `integer`):

How deep to follow nested sitemap index files (sitemapindex). 1 = only the given sitemaps, 2 = follow one level of nested indexes, etc.

## `proxyConfiguration` (type: `object`):

Optional proxy settings. Leave empty to use no proxy.

## `requestTimeoutSecs` (type: `integer`):

Maximum time in seconds to wait for each sitemap request.

## `filterUrlPattern` (type: `string`):

Optional regex pattern to filter output URLs. Only URLs matching the pattern will be saved. Leave empty to keep all URLs.

## Actor input object example

```json
{
  "sitemapUrls": [
    "https://mikolabs.xyz/sitemap.xml"
  ],
  "maxDepth": 3,
  "requestTimeoutSecs": 30,
  "filterUrlPattern": "https://example\\.com/blog/.*"
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "sitemapUrls": [
        "https://mikolabs.xyz/sitemap.xml"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("mikolabs/sitemap-url-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "sitemapUrls": ["https://mikolabs.xyz/sitemap.xml"] }

# Run the Actor and wait for it to finish
run = client.actor("mikolabs/sitemap-url-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "sitemapUrls": [
    "https://mikolabs.xyz/sitemap.xml"
  ]
}' |
apify call mikolabs/sitemap-url-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=mikolabs/sitemap-url-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap URL Extractor",
        "description": "Extract every URL and its metadata from any sitemap.xml in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with url, lastmod, changefreq, priority, and more — ready to export as CSV, JSON, or Excel.",
        "version": "0.0",
        "x-build-id": "PDclRcpjdN3bcoA4B"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/mikolabs~sitemap-url-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-mikolabs-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/mikolabs~sitemap-url-extractor/runs": {
            "post": {
                "operationId": "runs-sync-mikolabs-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/mikolabs~sitemap-url-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-mikolabs-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "sitemapUrls"
                ],
                "properties": {
                    "sitemapUrls": {
                        "title": "Sitemap URLs",
                        "minItems": 1,
                        "type": "array",
                        "description": "List of sitemap.xml URLs to extract data from (e.g. https://onescales.com/sitemap.xml). Supports nested sitemapindex files.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxDepth": {
                        "title": "Max Sitemap Index Depth",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "How deep to follow nested sitemap index files (sitemapindex). 1 = only the given sitemaps, 2 = follow one level of nested indexes, etc.",
                        "default": 3
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Optional proxy settings. Leave empty to use no proxy."
                    },
                    "requestTimeoutSecs": {
                        "title": "Request Timeout (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "Maximum time in seconds to wait for each sitemap request.",
                        "default": 30
                    },
                    "filterUrlPattern": {
                        "title": "Filter URL Pattern (regex)",
                        "type": "string",
                        "description": "Optional regex pattern to filter output URLs. Only URLs matching the pattern will be saved. Leave empty to keep all URLs."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
