# Sitemap Sniffer (`maximedupre/sitemap-sniffer`) Actor

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

- **URL**: https://apify.com/maximedupre/sitemap-sniffer.md
- **Developed by:** [Maxime Dupré](https://apify.com/maximedupre) (community)
- **Categories:** Developer tools, Marketing
- **Stats:** 2 total users, 1 monthly users, 85.7% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.90 / 1,000 discovered sitemap items

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### 🗺️ Sitemap sniffer for SEO audits

Sitemap Sniffer finds public sitemap files for websites, domains, `robots.txt` files, direct sitemap URLs, and sitemap indexes. Use this sitemap sniffer when you need a quick SEO sitemap audit, a sitemap finder for multiple sites, or a sitemap URL extractor before a crawl.

Start with a public website such as [apify.com](https://apify.com), a bare domain such as `example.com`, or a known sitemap such as `https://example.com/sitemap.xml`. The Actor checks public sitemap sources, follows sitemap indexes when enabled, and saves clean output rows you can export from Apify or use through the API.

### 🔎 What this Actor does

- Reads public `robots.txt` files and follows `Sitemap:` directives.
- Checks common sitemap paths for website roots and bare domains.
- Accepts direct sitemap, sitemap index, and `robots.txt` URLs.
- Parses XML sitemap indexes, XML URL sets, plain-text sitemaps, and gzipped sitemap responses.
- Follows nested sitemap indexes within your depth and output limits.
- Saves one sitemap row per discovered sitemap file.
- Optionally emits URL inventory rows from sitemap contents.
- Adds one target summary row per submitted target, including no-sitemap outcomes.

This Actor is focused on public sitemap discovery. It does not crawl arbitrary internal links, scrape page content, check broken links, submit sitemaps to search engines, or validate whether URLs are indexed.

### 📦 Data you get

Each run can return three output types.

Sitemap rows describe discovered sitemap files:

- sitemap URL, canonical URL, parent sitemap URL, and index depth
- target website, normalized origin, and domain host
- sitemap type, HTTP status, content type, byte count, and compression flag
- URL count, child sitemap count, first `lastmod`, and discovery source
- all discovery sources when the same sitemap is found more than one way

URL inventory rows are optional. When enabled, they include each URL found inside parsed sitemaps, the source sitemap URL, `lastmod`, `changefreq`, `priority`, and `hreflang` alternates when the sitemap provides them.

Target summary rows make batch runs easier to filter. They report whether each target was completed, skipped, or produced no public sitemap files.

### 🚀 How to run it

1. Add one or more website or sitemap targets.
2. Keep sitemap index following enabled for normal SEO audits.
3. Leave URL inventory rows off for a fast sitemap-file audit.
4. Turn on URL inventory rows when you want the URLs listed inside the sitemaps.
5. Set sitemap and URL row limits to control output size and cost.
6. Run the Actor and open the dataset overview.

No cookies, login, source API key, or proxy settings are needed from you. The target must expose public sitemap assets over `http` or `https`.

### ⚙️ Input example

```json
{
	"targets": [
		"https://apify.com",
		"example.com",
		"https://example.com/sitemap.xml"
	],
	"followSitemapIndexes": true,
	"maxIndexDepth": 1,
	"parseSitemapDetails": true,
	"emitUrlRows": false,
	"maxSitemapRows": 10,
	"maxUrlRows": 10000
}
````

`Website or sitemap targets` is the only required input. You can paste roots, bare domains, `robots.txt` URLs, sitemap URLs, or sitemap index URLs in the same list.

Use `Follow sitemap indexes` and `Maximum sitemap index depth` to control nested index expansion. Use `Parse sitemap details` when you want counts, type, size, compression, and URL metadata. Use `Emit URL inventory rows` only when you want individual URLs from the sitemaps in the dataset.

### 🧾 Output example

```json
{
	"recordType": "sitemap",
	"target": "https://apify.com",
	"targetIndex": 0,
	"normalizedOrigin": "https://apify.com",
	"domainHost": "apify.com",
	"url": "https://apify.com/sitemap.xml",
	"canonicalUrl": "https://apify.com/sitemap.xml",
	"type": "sitemap_index",
	"httpStatus": 200,
	"contentType": "application/xml",
	"byteCount": 1240,
	"urlCount": 0,
	"childSitemapCount": 8,
	"isCompressed": false,
	"lastmod": "2026-06-01",
	"discoveredVia": "robots.txt",
	"discoverySources": ["robots.txt"],
	"parentSitemapUrl": null,
	"depth": 0,
	"scrapedAt": "2026-06-15T12:00:00.000Z"
}
```

When URL inventory is enabled, URL rows use `recordType: "url"` and include `url`, `sitemapUrl`, `lastmod`, `changefreq`, `priority`, and `hreflang` when available.

### 💳 Pricing

Sitemap Sniffer uses pay-per-event pricing. One charged event is one discovered sitemap item, URL inventory item, or target summary saved by the run.

Keep URL inventory rows off when you only need sitemap-file metadata. Turn them on when you need a larger URL export for crawl planning, migrations, RAG source lists, or SEO checks.

### ⚠️ Limits and caveats

- Sitemap files must be publicly reachable.
- Some websites do not publish sitemap files, or publish them only for selected sections.
- Very large sitemap indexes can create many child sitemap or URL rows, so use the row limits for predictable output.
- Sitemap metadata is only as complete as the source file. Missing `lastmod`, `changefreq`, `priority`, or `hreflang` values are not guessed.
- This Actor reports public sitemap assets. It does not prove that search engines have indexed the URLs.

### ❓ FAQ

#### 🔐 Do I need login credentials or an API key?

No. This Actor reads public sitemap assets. You do not need to provide cookies, login credentials, a source API key, or proxy settings.

#### 🧭 Can it crawl my whole website?

No. Use this Actor to discover sitemap files and, optionally, the URLs listed inside those sitemap files. For rendered page crawling and link maps, use Website URL Crawler.

#### 🧩 Can I submit more than one website?

Yes. Add multiple targets to the same run. The output keeps `target` and `targetIndex` fields so you can filter each website separately.

#### 📄 Why did I get a target summary but no sitemap rows?

That usually means the target did not expose a public sitemap through `robots.txt`, common sitemap paths, or the direct URL you submitted. The run still completes so you can audit batches without one empty target failing the whole job.

### 📝 Changelog

- 0.1: Initial release.

### 🆘 Support

For issues, questions, or feature requests, [file a ticket](https://console.apify.com/actors/maximedupre~sitemap-sniffer/issues) and I'll fix or implement it in less than 24h 🫡

### 🔗 Other actors

- [Robots.txt Generator ↗](https://apify.com/maximedupre/robots-txt-generator) - Generate deployable robots.txt files with sitemap directives and crawler rules.
- [Website URL Crawler ↗](https://apify.com/maximedupre/website-url-crawler) - Crawl rendered website pages and export discovered links with source context.
- [Webpage Text Extractor ↗](https://apify.com/maximedupre/webpage-text-extractor) - Extract clean text or Markdown from public webpages after you collect URLs.
- [Web Images Scraper ↗](https://apify.com/maximedupre/web-images-scraper) - Extract image URLs and optional image files from public webpages.
- [RSS Feed Reader ↗](https://apify.com/maximedupre/rss-feed-reader) - Read public RSS, Atom, RDF, and JSON Feed URLs into clean dataset rows.

**Made with ❤️ by Maxime Dupré**

# Actor input Schema

## `targets` (type: `array`):

Add website roots, bare domains, robots.txt URLs, sitemap files, or sitemap indexes.

## `followSitemapIndexes` (type: `boolean`):

Fetch child sitemaps listed inside sitemap-index files.

## `maxIndexDepth` (type: `integer`):

Set how many nested sitemap-index levels to follow for each target.

## `parseSitemapDetails` (type: `boolean`):

Download sitemap bodies to return URL counts, child-sitemap counts, size, compression, and lastmod facts.

## `emitUrlRows` (type: `boolean`):

Add one row for each URL found inside discovered sitemap files.

## `maxSitemapRows` (type: `integer`):

Limit discovered sitemap-file rows across all targets.

## `maxUrlRows` (type: `integer`):

Limit URL inventory rows when URL row output is enabled.

## Actor input object example

```json
{
  "targets": [
    "https://apify.com",
    "apify.com/sitemap.xml"
  ],
  "followSitemapIndexes": true,
  "maxIndexDepth": 1,
  "parseSitemapDetails": true,
  "emitUrlRows": false,
  "maxSitemapRows": 10,
  "maxUrlRows": 10000
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "targets": [
        "https://apify.com",
        "apify.com/sitemap.xml"
    ],
    "followSitemapIndexes": true,
    "maxIndexDepth": 1,
    "parseSitemapDetails": true,
    "emitUrlRows": false,
    "maxSitemapRows": 10,
    "maxUrlRows": 10000
};

// Run the Actor and wait for it to finish
const run = await client.actor("maximedupre/sitemap-sniffer").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "targets": [
        "https://apify.com",
        "apify.com/sitemap.xml",
    ],
    "followSitemapIndexes": True,
    "maxIndexDepth": 1,
    "parseSitemapDetails": True,
    "emitUrlRows": False,
    "maxSitemapRows": 10,
    "maxUrlRows": 10000,
}

# Run the Actor and wait for it to finish
run = client.actor("maximedupre/sitemap-sniffer").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "targets": [
    "https://apify.com",
    "apify.com/sitemap.xml"
  ],
  "followSitemapIndexes": true,
  "maxIndexDepth": 1,
  "parseSitemapDetails": true,
  "emitUrlRows": false,
  "maxSitemapRows": 10,
  "maxUrlRows": 10000
}' |
apify call maximedupre/sitemap-sniffer --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=maximedupre/sitemap-sniffer",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap Sniffer",
        "description": "Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.",
        "version": "0.1",
        "x-build-id": "rw7U1GoICcWG1q3oF"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/maximedupre~sitemap-sniffer/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-maximedupre-sitemap-sniffer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/maximedupre~sitemap-sniffer/runs": {
            "post": {
                "operationId": "runs-sync-maximedupre-sitemap-sniffer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/maximedupre~sitemap-sniffer/run-sync": {
            "post": {
                "operationId": "run-sync-maximedupre-sitemap-sniffer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "targets"
                ],
                "properties": {
                    "targets": {
                        "title": "Website or sitemap targets",
                        "minItems": 1,
                        "uniqueItems": true,
                        "type": "array",
                        "description": "Add website roots, bare domains, robots.txt URLs, sitemap files, or sitemap indexes.",
                        "items": {
                            "type": "string",
                            "minLength": 1
                        }
                    },
                    "followSitemapIndexes": {
                        "title": "Follow sitemap indexes",
                        "type": "boolean",
                        "description": "Fetch child sitemaps listed inside sitemap-index files.",
                        "default": true
                    },
                    "maxIndexDepth": {
                        "title": "Maximum index depth",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Set how many nested sitemap-index levels to follow for each target.",
                        "default": 1
                    },
                    "parseSitemapDetails": {
                        "title": "Count URLs in sitemaps",
                        "type": "boolean",
                        "description": "Download sitemap bodies to return URL counts, child-sitemap counts, size, compression, and lastmod facts.",
                        "default": true
                    },
                    "emitUrlRows": {
                        "title": "Emit URL inventory rows",
                        "type": "boolean",
                        "description": "Add one row for each URL found inside discovered sitemap files.",
                        "default": false
                    },
                    "maxSitemapRows": {
                        "title": "Maximum sitemap rows",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Limit discovered sitemap-file rows across all targets.",
                        "default": 10
                    },
                    "maxUrlRows": {
                        "title": "Maximum URL inventory rows",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Limit URL inventory rows when URL row output is enabled.",
                        "default": 10000
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
