# Sitemap Sniffer (`crawlerbros/sitemap-sniffer`) Actor

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

- **URL**: https://apify.com/crawlerbros/sitemap-sniffer.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Developer tools, SEO tools, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 14 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Sitemap Sniffer

Discover every sitemap file for a website — automatically. Reads `robots.txt` for `Sitemap:` directives, probes 16 common sitemap paths (Yoast, WordPress, sitemap-index, gzipped variants), and recursively unpacks sitemap-index files. HTTP-only, no proxy, no cookies, no API key.

### What it does

You point this actor at any website and get back a structured list of every sitemap file it could find:

- **`/robots.txt` directives** — the canonical place sites declare their sitemaps.
- **Common sitemap paths** — `sitemap.xml`, `sitemap_index.xml`, `wp-sitemap.xml`, `post-sitemap.xml`, `sitemap.xml.gz`, and 11 more.
- **Sitemap-index expansion** — when an index points to child sitemaps, the actor follows it (one level deep) and emits each child too.

For each discovered sitemap, the actor reports the URL, type (`sitemap` / `sitemap_index` / `txt`), HTTP status, content type, byte size, URL count (parsed from XML), gzip flag, last-modified date if present, and how it was discovered.

### Input

| Field | Type | Default | Description |
|---|---|---|---|
| `url` | string (required) | `https://apify.com` | Root URL or bare host (e.g. `example.com`). The actor extracts the origin and probes that. |
| `followIndexes` | boolean | `true` | When a sitemap-index is found, also fetch and emit the child sitemap URLs it points to. |
| `maxSitemaps` | integer | `50` (1–1000) | Hard cap on the number of records emitted. Probing stops once this many are discovered. |
| `fetchUrlCounts` | boolean | `true` | Parse each sitemap and report the number of URLs it contains. Disable to skip the full-body download. |
| `emitUrls` | boolean | `false` | When `true`, the actor also emits one record per URL found inside each discovered sitemap (with `lastmod`, `changefreq`, `priority`, `hreflang` when present). |
| `maxUrls` | integer | `10000` (1–100000) | Hard cap on per-URL records when `emitUrls: true`. Has no effect when `emitUrls: false`. |
| `userAgent` | string (optional) | Chrome 131 | Override only if a target server filters by UA. |

#### Example input

```json
{
  "url": "https://www.bbc.com",
  "followIndexes": true,
  "maxSitemaps": 50,
  "fetchUrlCounts": true
}
````

### Output

By default, one record per discovered sitemap. When `emitUrls: true`, the dataset also contains one record per URL found inside each sitemap. The two shapes can be disambiguated by `recordType`. Empty fields are omitted (no nulls).

#### Sitemap record (`recordType: "sitemap"`)

```json
{
  "recordType": "sitemap",
  "url": "https://www.bbc.com/sitemap.xml",
  "domainHost": "www.bbc.com",
  "type": "sitemap_index",
  "httpStatus": 200,
  "contentType": "application/xml",
  "byteCount": 13450,
  "urlCount": 78,
  "isCompressed": false,
  "lastmod": "2024-12-15",
  "discoveredVia": "robots.txt",
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}
```

#### URL record (`recordType: "url"`, only when `emitUrls: true`)

```json
{
  "recordType": "url",
  "url": "https://www.bbc.com/news/articles/c-12345",
  "domainHost": "www.bbc.com",
  "sitemapUrl": "https://www.bbc.com/sitemaps/news/sitemap.xml",
  "lastmod": "2024-12-15",
  "changefreq": "hourly",
  "priority": 0.8,
  "hreflang": [{"lang": "en-GB", "href": "https://www.bbc.com/news/articles/c-12345"}],
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}
```

#### Output fields

- **`recordType`** — `"sitemap"` for sitemap-file records (always emitted), or `"url"` for per-URL records (only when `emitUrls: true`).
- **`url`** — absolute URL of the sitemap (or the URL referenced inside a sitemap, when `recordType: "url"`).
- **`domainHost`** — parsed hostname of `url` (handy for grouping records by site when ingesting from multiple runs).
- **`type`** — `"sitemap"` (a `<urlset>`), `"sitemap_index"` (a `<sitemapindex>` of child sitemaps), or `"txt"` (plain-text sitemap with one URL per line). Sitemap records only.
- **`httpStatus`** — HTTP status code returned (200 = success). Sitemap records only.
- **`contentType`** — `Content-Type` header value (without charset). Sitemap records only.
- **`byteCount`** — response body size in bytes. Sitemap records only.
- **`urlCount`** — number of URL entries found inside the sitemap (or child sitemap links inside an index). Sitemap records only.
- **`isCompressed`** — `true` when the body is gzipped (e.g. `.xml.gz` paths). Sitemap records only.
- **`lastmod`** — first `<lastmod>` value found in the sitemap, or the per-URL `<lastmod>` when `recordType: "url"`.
- **`discoveredVia`** — `"robots.txt"`, `"common-path"`, or `"sitemap-index"` (parent index pointed here). Sitemap records only.
- **`sitemapUrl`** — (URL records only) URL of the sitemap that contained this URL.
- **`changefreq`** / **`priority`** / **`hreflang`** — (URL records only) standard sitemap fields when present.
- **`scrapedAt`** — ISO-8601 timestamp of the discovery.

### When to use this

- **Before crawling** — feed the discovered sitemap URLs into a downstream crawler so you scrape only what's listed instead of guessing internal links.
- **SEO audits** — confirm a site has a sitemap, that it points to the right pages, and that index files aren't broken.
- **Competitive research** — measure how many URLs a site exposes, broken down by sitemap type (news / video / image / page).
- **Content migration** — get a complete inventory of URLs declared by the source site.

### FAQ

**Does it need cookies, login, or a proxy?**
No. Sitemaps are public assets, designed to be machine-readable. The actor uses `curl_cffi` with a Chrome User-Agent and connects directly.

**What if the site has no sitemap at all?**
The actor emits a single record `{"type": "sitemap_sniffer_error", "reason": "no_sitemaps_found"}` with a hint to check `robots.txt` manually. The run still completes successfully — empty datasets are not treated as failures.

**Does it handle gzipped sitemaps?**
Yes. `.xml.gz` files are transparently decompressed in-memory before parsing.

**How does it handle giant sites with thousands of sitemap files?**
`maxSitemaps` (default 50, max 1000) caps the run. The actor probes in priority order: robots.txt directives first, then the most common paths, then sitemap-index children. You'll get the most useful sitemaps first even if the cap stops the run early.

**Can I get the URLs inside each sitemap?**
Yes — set `emitUrls: true` and the actor will also push one record per URL inside each discovered sitemap, with `lastmod`, `changefreq`, `priority`, and `hreflang` when present. `maxUrls` caps the total (default 10,000). Use `recordType: "sitemap"` vs `recordType: "url"` to disambiguate the two record shapes.

**Is it safe to run on any website?**
Yes — the actor only fetches `robots.txt` and 16 well-known public paths. It makes at most ~17 requests on initial probe, plus one per sitemap-index child if `followIndexes` is enabled. No login pages, no admin paths, no API endpoints.

# Actor input Schema

## `url` (type: `string`):

Root URL of the website to probe (e.g. 'https://example.com' or just 'example.com'). The actor extracts the host and probes that origin for sitemap files.

## `followIndexes` (type: `boolean`):

When a `<sitemapindex>` (sitemap-of-sitemaps) is found, also fetch and emit the child sitemap URLs it points to. Disable to only emit the index itself.

## `maxSitemaps` (type: `integer`):

Hard cap on the number of sitemap records emitted. Probing stops once this many are discovered. Helps keep large sites bounded.

## `fetchUrlCounts` (type: `boolean`):

Parse each discovered sitemap and emit the count of `<url>` entries (urlCount field). Adds ~0.2-0.5s per sitemap. Disable to only report metadata (URL, status, type) without download.

## `emitUrls` (type: `boolean`):

Also push one record per URL found inside discovered sitemaps. Each URL record includes lastmod, changefreq, priority. Default: only sitemap-file records.

## `maxUrls` (type: `integer`):

Hard cap on per-URL records emitted when emitUrls is true.

## `userAgent` (type: `string`):

Override the default Chrome User-Agent string. Most sites accept the default; only set this if a target server filters by UA.

## Actor input object example

```json
{
  "url": "https://apify.com",
  "followIndexes": true,
  "maxSitemaps": 50,
  "fetchUrlCounts": true,
  "emitUrls": false,
  "maxUrls": 10000
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "https://apify.com",
    "maxSitemaps": 50
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/sitemap-sniffer").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "url": "https://apify.com",
    "maxSitemaps": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/sitemap-sniffer").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "https://apify.com",
  "maxSitemaps": 50
}' |
apify call crawlerbros/sitemap-sniffer --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/sitemap-sniffer",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap Sniffer",
        "description": "Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.",
        "version": "0.1",
        "x-build-id": "dqcPmCf0ryPSglOwW"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~sitemap-sniffer/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-sitemap-sniffer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~sitemap-sniffer/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-sitemap-sniffer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~sitemap-sniffer/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-sitemap-sniffer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "url": {
                        "title": "Website URL or domain",
                        "type": "string",
                        "description": "Root URL of the website to probe (e.g. 'https://example.com' or just 'example.com'). The actor extracts the host and probes that origin for sitemap files."
                    },
                    "followIndexes": {
                        "title": "Follow sitemap indexes",
                        "type": "boolean",
                        "description": "When a `<sitemapindex>` (sitemap-of-sitemaps) is found, also fetch and emit the child sitemap URLs it points to. Disable to only emit the index itself.",
                        "default": true
                    },
                    "maxSitemaps": {
                        "title": "Maximum sitemaps to emit",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Hard cap on the number of sitemap records emitted. Probing stops once this many are discovered. Helps keep large sites bounded.",
                        "default": 50
                    },
                    "fetchUrlCounts": {
                        "title": "Fetch URL counts",
                        "type": "boolean",
                        "description": "Parse each discovered sitemap and emit the count of `<url>` entries (urlCount field). Adds ~0.2-0.5s per sitemap. Disable to only report metadata (URL, status, type) without download.",
                        "default": true
                    },
                    "emitUrls": {
                        "title": "Emit per-URL records",
                        "type": "boolean",
                        "description": "Also push one record per URL found inside discovered sitemaps. Each URL record includes lastmod, changefreq, priority. Default: only sitemap-file records.",
                        "default": false
                    },
                    "maxUrls": {
                        "title": "Maximum URLs",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Hard cap on per-URL records emitted when emitUrls is true.",
                        "default": 10000
                    },
                    "userAgent": {
                        "title": "Custom User-Agent (optional)",
                        "type": "string",
                        "description": "Override the default Chrome User-Agent string. Most sites accept the default; only set this if a target server filters by UA."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
