# Sitemap URL Extractor (`crawlerbros/sitemap-url-extractor`) Actor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.

- **URL**: https://apify.com/crawlerbros/sitemap-url-extractor.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Developer tools, SEO tools, Automation
- **Stats:** 2 total users, 0 monthly users, 100.0% runs succeeded, 14 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Sitemap URL Extractor — Pull Every URL From Any Website's Sitemap

Extract the complete URL inventory of any website in seconds — straight from its XML sitemap, no proxy or login required.

### What this actor does

Point this actor at a website and it returns every page URL that site publishes in its sitemap. You can pass a direct sitemap URL (`https://example.com/sitemap.xml`), a gzipped sitemap (`https://example.com/sitemap.xml.gz`), a sitemap index file that points at dozens of child sitemaps, or just a bare domain (`https://example.com`) and the actor will discover the sitemap for you via `robots.txt` and common fallback paths.

Behind the scenes the actor performs pure XML parsing over standard HTTP. It walks nested sitemap index files to any depth, decompresses gzipped sitemaps automatically, and preserves per-URL metadata such as last-modified dates, change frequency, priority, and image/video/hreflang annotations whenever the source sitemap includes them. Include and exclude regex filters let you narrow the output to just the section of the site you care about (e.g. only product pages, only blog posts).

Because sitemaps are public endpoints designed to be consumed by search engines, this actor runs without cookies, without proxies, and without any authentication — making it one of the cheapest and fastest ways to map a website end-to-end.

### Key features

- **Bare-domain auto-discovery** — pass `https://example.com` and the actor finds the sitemap via `robots.txt` and common paths (`/sitemap.xml`, `/sitemap_index.xml`).
- **Gzipped sitemap support** — `.xml.gz` files are decompressed transparently.
- **Recursive sitemap index walking** — nested sitemap trees are fetched and merged into a single flat list.
- **Image sitemap extraction** — preserves image URLs, captions, titles when the sitemap contains them.
- **Video & hreflang annotations** — kept per URL whenever the source exposes them.
- **Regex include / exclude filters** — trim the output to a specific section of the site.
- **Max URL cap** — stop early once you have enough data.
- **Public data only** — no proxy, no cookies, no login.

### Input

| Field | Type | Description |
|---|---|---|
| `startUrls` | array | One or more sitemap URLs, sitemap index URLs, `.xml.gz` URLs, or bare domains. |
| `maxUrls` | integer | Upper bound on total URLs returned. `0` means unlimited. |
| `followSitemapIndexes` | boolean | When true (default), nested sitemap index files are expanded recursively. |
| `urlFilterInclude` | string | Optional regex. Only URLs matching this pattern are kept. |
| `urlFilterExclude` | string | Optional regex. URLs matching this pattern are dropped. |

**Example input**

```json
{
  "startUrls": [
    { "url": "https://www.nytimes.com" },
    { "url": "https://example.com/sitemap.xml" }
  ],
  "maxUrls": 5000,
  "followSitemapIndexes": true,
  "urlFilterInclude": "/202[4-6]/",
  "urlFilterExclude": "/tag/|/category/"
}
````

### Output

Each dataset record represents one URL found in a sitemap:

```json
{
  "url": "https://example.com/products/red-shoes",
  "lastmod": "2026-03-12T08:14:00+00:00",
  "changefreq": "weekly",
  "priority": 0.8,
  "source": "https://example.com/sitemap-products.xml",
  "images": [
    { "loc": "https://example.com/img/red-shoes.jpg", "title": "Red Shoes" }
  ],
  "alternates": [
    { "hreflang": "de", "href": "https://example.com/de/schuhe-rot" }
  ]
}
```

**Field descriptions**

- `url` — the page URL listed in the sitemap.
- `lastmod` — last-modified timestamp reported by the sitemap, if present.
- `changefreq` — publisher-declared change frequency (`daily`, `weekly`, `monthly`, …).
- `priority` — publisher-declared priority hint (0.0–1.0).
- `source` — the sitemap file this URL was extracted from (useful with nested indexes).
- `images` — image sitemap entries attached to the URL (when present).
- `videos` — video sitemap entries attached to the URL (when present).
- `alternates` — `hreflang` alternate-language URLs (when present).

Empty or missing fields are omitted rather than emitted as `null`, so records stay compact.

### Use cases

- **SEO audits** — pull every indexable URL a site exposes, then cross-check against Google Search Console coverage.
- **Site migrations** — generate a full URL inventory before cutover to build redirect maps.
- **Competitive research** — map a competitor's product catalog, blog, or news archive.
- **Content crawling seed list** — feed the extracted URLs into a downstream scraper or LLM ingestion pipeline.
- **Broken-link discovery** — pair the URL list with a link checker to find 404s across large sites.

### FAQ

**Do I need a proxy?**
No. Sitemap endpoints are public and designed for consumption by search engines, so they do not gate by IP or require cookies.

**What if I only have a domain and don't know the sitemap URL?**
Pass the bare domain (`https://example.com`). The actor reads `robots.txt`, falls back to common sitemap paths, and walks any index files it finds.

**How does it handle very large sitemaps?**
Each sitemap file is streamed and parsed as it's received. Use `maxUrls` to cap the output when you do not need the entire site.

**Does it follow nested sitemap index files?**
Yes, by default. Set `followSitemapIndexes` to `false` to stop at the first level and receive only index entries.

**Can it extract images from image sitemaps?**
Yes. If the sitemap uses the `<image:image>` extension, the image entries are preserved on the per-URL record.

**Why is `lastmod` missing on some records?**
Sitemap fields are optional. If the publisher did not include a last-modified date for a URL, it is omitted from the output instead of padded with a placeholder.

**What happens if the sitemap URL returns a 404 or HTML page?**
The actor emits a compact error record describing the failure and continues with any other inputs. Your dataset is never silently empty.

### Known limitations

- **HTML-only websites with no sitemap** cannot be mapped. If `robots.txt` does not declare a sitemap and common fallback paths are empty, the actor has nothing to parse. Use a full crawler for those cases.
- **JavaScript-built sitemap pages** (rare — some single-page apps render `/sitemap` as HTML) are not XML and are not supported.
- **Extremely deep nested indexes** — sitemap trees are expanded recursively, but the total URL count is still bounded by `maxUrls` to keep runs predictable.
- **Sitemap accuracy depends on the publisher** — if the site forgets to list a URL in its sitemap, this actor cannot discover it (there is no HTML fallback crawling here by design).

# Actor input Schema

## `startUrls` (type: `array`):

One or more URLs. Accepts: a direct sitemap.xml URL (e.g., https://example.com/sitemap.xml), a sitemap index file, a gzipped sitemap (.xml.gz), or a bare domain (e.g., https://example.com) — the actor will auto-discover the sitemap via robots.txt and common paths.

## `maxUrls` (type: `integer`):

Maximum number of URLs to extract across all sitemaps. Set to 0 for no limit.

## `followSitemapIndexes` (type: `boolean`):

If the input points to a sitemap index (<sitemapindex>), fetch and merge every child sitemap.

## `urlFilterInclude` (type: `string`):

Only emit URLs matching this Python regex (case-insensitive). Leave blank to include everything.

## `urlFilterExclude` (type: `string`):

Drop URLs matching this Python regex (case-insensitive). Applied after the include filter.

## Actor input object example

```json
{
  "startUrls": [
    "https://onescales.com/sitemap.xml"
  ],
  "maxUrls": 500,
  "followSitemapIndexes": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://onescales.com/sitemap.xml"
    ],
    "maxUrls": 500,
    "followSitemapIndexes": true
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/sitemap-url-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": ["https://onescales.com/sitemap.xml"],
    "maxUrls": 500,
    "followSitemapIndexes": True,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/sitemap-url-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://onescales.com/sitemap.xml"
  ],
  "maxUrls": 500,
  "followSitemapIndexes": true
}' |
apify call crawlerbros/sitemap-url-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/sitemap-url-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap URL Extractor",
        "description": "Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.",
        "version": "1.0",
        "x-build-id": "xj9eBBKHnvjaDG0QZ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~sitemap-url-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~sitemap-url-extractor/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~sitemap-url-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Sitemap or site URLs",
                        "minItems": 1,
                        "type": "array",
                        "description": "One or more URLs. Accepts: a direct sitemap.xml URL (e.g., https://example.com/sitemap.xml), a sitemap index file, a gzipped sitemap (.xml.gz), or a bare domain (e.g., https://example.com) — the actor will auto-discover the sitemap via robots.txt and common paths.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxUrls": {
                        "title": "Max URLs",
                        "minimum": 0,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Maximum number of URLs to extract across all sitemaps. Set to 0 for no limit.",
                        "default": 0
                    },
                    "followSitemapIndexes": {
                        "title": "Follow sitemap index files",
                        "type": "boolean",
                        "description": "If the input points to a sitemap index (<sitemapindex>), fetch and merge every child sitemap.",
                        "default": true
                    },
                    "urlFilterInclude": {
                        "title": "URL include regex (optional)",
                        "type": "string",
                        "description": "Only emit URLs matching this Python regex (case-insensitive). Leave blank to include everything."
                    },
                    "urlFilterExclude": {
                        "title": "URL exclude regex (optional)",
                        "type": "string",
                        "description": "Drop URLs matching this Python regex (case-insensitive). Applied after the include filter."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
