# Sitemap & URL Extractor — Get Every URL of a Website (`dataquarry/sitemap-url-extractor`) Actor

Get every URL of a website: parses sitemap.xml and sitemap-indexes (discovered via robots.txt or the default location), with a same-site crawl fallback when there's no sitemap. Returns each URL + lastmod. No API key.

- **URL**: https://apify.com/dataquarry/sitemap-url-extractor.md
- **Developed by:** [Daniel Brenner](https://apify.com/dataquarry) (community)
- **Categories:** Developer tools, SEO tools, AI
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Sitemap & URL Extractor — Get Every URL of a Website

**Free.** Give it a website (or a sitemap URL) and get back **every URL on the site** — parsed from `sitemap.xml` and sitemap-indexes (auto-discovered via `robots.txt` and the default location), with a **same-site crawl fallback** when a site has no sitemap. No API key.

Perfect for **feeding an LLM/RAG pipeline** (find every page to ingest), site audits, migrations, link checking, and SEO.

### What you get (per URL)

- `url` — the page URL (absolute, deduped)
- `lastmod` — last-modified date from the sitemap, when present (honest-null otherwise)
- `source` — `"sitemap"` or `"crawl"` (how the URL was found)
- `discoveredAt`

### How to use it

```json
{ "startUrls": ["https://example.com"], "maxResults": 5000 }
````

Pass a **site URL** (the sitemap is found automatically) or a **direct sitemap URL**. It handles **sitemap-indexes** (sites that split their sitemap into many files) by following each child sitemap, and if there's no sitemap at all it falls back to a polite, same-site **crawl**. It respects `robots.txt`, identifies itself, and fetches one request at a time.

### Pair it: discover → extract → audit

This is the **discover** step of a clean "feed-your-AI" toolkit by **dataquarry**:

1. **Discover** — *this actor*: every URL of a site.
2. **Extract** — [`dataquarry/website-to-markdown`](https://apify.com/dataquarry/website-to-markdown): turn those URLs into clean, LLM-ready Markdown.
3. **Audit** — [`dataquarry/website-seo-metadata-checker`](https://apify.com/dataquarry/website-seo-metadata-checker): SEO & metadata for each page.

Also see the [dataquarry OSM place-data scrapers](https://apify.com/dataquarry) and free guides at [openplacedata.com](https://openplacedata.com).

### Clean & honest

Reads only public `sitemap.xml`/`robots.txt` and (in fallback) public pages; respects `robots.txt`; sends a descriptive User-Agent; no logins, no PII. Missing values are `null`, never guessed.

### FAQ

**Do I need an API key?** No — give it a URL and run it. It's free.

**What if the site has no sitemap?** It crawls the site's own links (same-domain, bounded) so you still get a URL list.

**Does it handle huge sitemap-indexes?** Yes — it follows child sitemaps up to the `maxSitemaps` and `maxResults` caps you set.

# Actor input Schema

## `startUrls` (type: `array`):

Site URLs (e.g. https://example.com) or direct sitemap URLs. For a site, the sitemap is auto-discovered via robots.txt and /sitemap.xml.

## `maxResults` (type: `integer`):

Maximum number of URLs to return.

## `crawlFallback` (type: `boolean`):

If a site has no sitemap, discover URLs by crawling its same-site links instead.

## `maxCrawlPages` (type: `integer`):

Hard cap on pages visited when using the crawl fallback.

## `maxSitemaps` (type: `integer`):

Hard cap on sitemap files fetched (sitemap-indexes can reference many).

## `respectRobotsTxt` (type: `boolean`):

In crawl-fallback mode, skip URLs disallowed by robots.txt for our user-agent.

## `requestDelayMs` (type: `integer`):

Optional politeness delay between requests, in milliseconds.

## Actor input object example

```json
{
  "startUrls": [
    "https://nodejs.org/"
  ],
  "maxResults": 5000,
  "crawlFallback": true,
  "maxCrawlPages": 100,
  "maxSitemaps": 50,
  "respectRobotsTxt": true,
  "requestDelayMs": 0
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://nodejs.org/"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("dataquarry/sitemap-url-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": ["https://nodejs.org/"] }

# Run the Actor and wait for it to finish
run = client.actor("dataquarry/sitemap-url-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://nodejs.org/"
  ]
}' |
apify call dataquarry/sitemap-url-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=dataquarry/sitemap-url-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap & URL Extractor — Get Every URL of a Website",
        "description": "Get every URL of a website: parses sitemap.xml and sitemap-indexes (discovered via robots.txt or the default location), with a same-site crawl fallback when there's no sitemap. Returns each URL + lastmod. No API key.",
        "version": "0.0",
        "x-build-id": "yn9P6Gp99bXpjNaw3"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/dataquarry~sitemap-url-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-dataquarry-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/dataquarry~sitemap-url-extractor/runs": {
            "post": {
                "operationId": "runs-sync-dataquarry-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/dataquarry~sitemap-url-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-dataquarry-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Websites / sitemap URLs",
                        "type": "array",
                        "description": "Site URLs (e.g. https://example.com) or direct sitemap URLs. For a site, the sitemap is auto-discovered via robots.txt and /sitemap.xml.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxResults": {
                        "title": "Max URLs",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of URLs to return.",
                        "default": 5000
                    },
                    "crawlFallback": {
                        "title": "Crawl if no sitemap",
                        "type": "boolean",
                        "description": "If a site has no sitemap, discover URLs by crawling its same-site links instead.",
                        "default": true
                    },
                    "maxCrawlPages": {
                        "title": "Max pages to crawl (fallback)",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Hard cap on pages visited when using the crawl fallback.",
                        "default": 100
                    },
                    "maxSitemaps": {
                        "title": "Max sitemap files",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Hard cap on sitemap files fetched (sitemap-indexes can reference many).",
                        "default": 50
                    },
                    "respectRobotsTxt": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "In crawl-fallback mode, skip URLs disallowed by robots.txt for our user-agent.",
                        "default": true
                    },
                    "requestDelayMs": {
                        "title": "Delay between requests (ms)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Optional politeness delay between requests, in milliseconds.",
                        "default": 0
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
