# Sitemap URL Extractor (`wiry_kingdom/sitemap-url-extractor`) Actor

Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.

- **URL**: https://apify.com/wiry\_kingdom/sitemap-url-extractor.md
- **Developed by:** [Mohieldin Mohamed](https://apify.com/wiry_kingdom) (community)
- **Categories:** SEO tools, Developer tools
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Sitemap URL Extractor

**Extract every URL from any website in seconds — with lastmod, changefreq, and priority metadata intact.**

This actor walks a site's `robots.txt`, discovers every declared sitemap, recursively expands sitemap index files, and dumps every single URL it finds into a structured Apify dataset you can download as JSON, CSV, or Excel.

### What does Sitemap URL Extractor do?

Give it one website URL. It returns every URL that site publishes in its `sitemap.xml` — including URLs buried inside multi-level sitemap index files, gzipped sitemaps, and sitemaps referenced from `robots.txt`. Perfect for **SEO audits**, **content migrations**, **site inventory**, and **competitor research**. No API keys. No browser. No proxy required for most sites.

Try it: paste `https://apify.com` into the Start URLs field, press Start, and watch the dataset fill up with every indexable URL on the site. A typical mid-sized company site (5,000–50,000 URLs) finishes in under a minute.

Apify platform advantages include scheduled runs (daily sitemap snapshots), API access, webhook integrations, proxy rotation when needed, and run history.

### Why use Sitemap URL Extractor?

- **SEO audits** — see every URL Google is supposed to index and compare against your canonical list
- **Content migration** — pull your entire old site's URL list before moving to a new CMS
- **Competitor intelligence** — see every public page a competitor publishes, including product catalogs and blog archives
- **Link checking** — feed the output into a link checker to find every broken link on a site
- **Snapshots over time** — schedule daily runs and diff URL lists to detect content changes
- **Dataset for LLM training** — get a clean list of URLs to feed into a content extractor

### How to use Sitemap URL Extractor

1. Click **Try for free** (or **Start** if you're already logged in)
2. In the **Start URLs** field, paste one or more website root URLs (e.g. `https://example.com`)
3. Optionally set **Max URLs per site** to cap output size
4. Click **Start**
5. Watch the dataset populate in real time in the Output tab
6. Download as JSON, CSV, or Excel, or hit the API endpoint directly

### Input

- **Start URLs** — one or more website root URLs to crawl (e.g. `https://apify.com`)
- **Max URLs per site** — safety cap (default 10,000, use 0 for unlimited)
- **Include metadata** — attach `lastmod`, `changefreq`, `priority` to each URL (default: yes)
- **Follow sitemap index** — recursively expand nested `<sitemapindex>` files (default: yes)
- **Proxy configuration** — optional Apify Proxy for sites that block raw server IPs

### Output

The actor pushes one dataset item per extracted URL. You can download in JSON, CSV, HTML, or Excel.

```json
{
    "url": "https://apify.com/apify/instagram-scraper",
    "lastmod": "2025-03-14",
    "changefreq": "daily",
    "priority": 0.8,
    "sourceWebsite": "https://apify.com",
    "sitemapUrl": "https://apify.com/sitemap.xml",
    "sitemapDepth": 0,
    "discoveredAt": "2026-04-15T18:30:00.000Z"
}
````

### Data table

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | The extracted URL from the sitemap |
| `lastmod` | string | Last modification date from the sitemap (if present) |
| `changefreq` | string | How often the page is expected to change (`daily`, `weekly`, `monthly`, etc.) |
| `priority` | number | SEO priority hint from 0.0 to 1.0 |
| `sourceWebsite` | string | The root URL you started from |
| `sitemapUrl` | string | The specific sitemap file where this URL was found |
| `sitemapDepth` | number | Nesting depth in sitemap index (0 = root sitemap) |
| `discoveredAt` | string | ISO timestamp of when the URL was extracted |

### Pricing

This actor uses Apify's **pay-per-event** pricing model so you only pay for what you get:

- **Actor start**: $0.01 per run (covers robots.txt + sitemap fetches)
- **Per URL extracted**: $0.0005 per URL added to your dataset

**Example costs:**

- A small blog with 500 URLs → ~$0.26
- A mid-sized site with 5,000 URLs → ~$2.51
- A large catalog with 50,000 URLs → ~$25.01

Free Apify tier members get $5/month in platform credits, which covers ~10,000 URLs of extraction per month.

### Tips and advanced options

- **Set `maxUrlsPerSite`** to a safe cap during testing (e.g. 100) to verify the actor works before running unlimited
- **Disable `includeMetadata`** if you only need URLs — this produces a much smaller dataset and faster downloads
- **Disable `followSitemapIndex`** to get only the top-level sitemap contents (useful for homepage/landing-page inventories)
- **Enable Apify Proxy** for sites that return 403 or ratelimit direct requests (government sites, some news publishers)
- **Schedule daily runs** via Apify's scheduler to track how a competitor's URL list changes over time — diff the datasets to see new product launches or archived content

### FAQ and support

**Is this legal?** The actor only reads publicly declared sitemap files. Sitemaps exist to be read by crawlers — by convention (and by the intent of the site owner who published them) they are meant for public consumption. Always respect the target site's Terms of Service and `robots.txt` disallow rules.

**What about gzipped sitemaps?** Fully supported. The actor auto-detects `.gz` URLs and `Content-Encoding: gzip` responses and decompresses transparently.

**What about nested sitemap indexes?** Supported up to 5 levels deep. Most sites have at most 2 levels (index → sitemap → urls).

**The actor returned 0 URLs, help!** The site probably doesn't publish a public sitemap. Try adding a custom sitemap URL explicitly in the Start URLs field (e.g. `https://example.com/sitemap_index.xml`).

**Found a bug or missing feature?** Open an issue on the Issues tab of this actor. Custom solutions available for enterprise use cases.

# Actor input Schema

## `startUrls` (type: `array`):

Root URLs of websites to crawl. The actor will look up robots.txt and /sitemap.xml for each site and recursively expand every sitemap it finds.

## `maxUrlsPerSite` (type: `integer`):

Safety cap on how many URLs to extract from a single website. Use 0 for unlimited.

## `includeMetadata` (type: `boolean`):

Include sitemap metadata (lastmod, changefreq, priority) with each extracted URL. Set to false to get only URLs (smaller dataset).

## `followSitemapIndex` (type: `boolean`):

When a sitemap is a <sitemapindex>, recursively fetch every child sitemap. Disable for a shallow crawl of only the root sitemap.

## `proxyConfiguration` (type: `object`):

Optional proxy settings. Some sites block direct server IPs — enable Apify Proxy to bypass.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ],
  "maxUrlsPerSite": 10000,
  "includeMetadata": true,
  "followSitemapIndex": true,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `dataset` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://apify.com"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("wiry_kingdom/sitemap-url-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://apify.com" }] }

# Run the Actor and wait for it to finish
run = client.actor("wiry_kingdom/sitemap-url-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ]
}' |
apify call wiry_kingdom/sitemap-url-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=wiry_kingdom/sitemap-url-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap URL Extractor",
        "description": "Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.",
        "version": "0.1",
        "x-build-id": "7vitDxQXuffUZEbHv"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/wiry_kingdom~sitemap-url-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-wiry_kingdom-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/wiry_kingdom~sitemap-url-extractor/runs": {
            "post": {
                "operationId": "runs-sync-wiry_kingdom-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/wiry_kingdom~sitemap-url-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-wiry_kingdom-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Website URLs",
                        "type": "array",
                        "description": "Root URLs of websites to crawl. The actor will look up robots.txt and /sitemap.xml for each site and recursively expand every sitemap it finds.",
                        "default": [
                            {
                                "url": "https://apify.com"
                            }
                        ],
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxUrlsPerSite": {
                        "title": "Max URLs per site",
                        "minimum": 0,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Safety cap on how many URLs to extract from a single website. Use 0 for unlimited.",
                        "default": 10000
                    },
                    "includeMetadata": {
                        "title": "Include lastmod / changefreq / priority",
                        "type": "boolean",
                        "description": "Include sitemap metadata (lastmod, changefreq, priority) with each extracted URL. Set to false to get only URLs (smaller dataset).",
                        "default": true
                    },
                    "followSitemapIndex": {
                        "title": "Recursively follow sitemap index files",
                        "type": "boolean",
                        "description": "When a sitemap is a <sitemapindex>, recursively fetch every child sitemap. Disable for a shallow crawl of only the root sitemap.",
                        "default": true
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Optional proxy settings. Some sites block direct server IPs — enable Apify Proxy to bypass.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```