# Sitemap Scraper (`scraperoka/sitemap-scraper`) Actor

🔎 Sitemap Scraper extracts URLs from XML sitemaps fast and accurately. 🚀 Perfect for SEO audits, link building, content discovery, and crawling planning. 📈 Get organized site maps in minutes—save time, boost rankings!

- **URL**: https://apify.com/scraperoka/sitemap-scraper.md
- **Developed by:** [Scraperoka](https://apify.com/scraperoka) (community)
- **Categories:** SEO tools, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.01 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Sitemap Scraper ⚡

Manually hunting down every page URL across a website takes hours and often misses important sections. **Sitemap Scraper** extracts all URLs from a sitemap (including sitemap indexes) and saves them to an Apify dataset—perfect for marketers, SEO specialists, and researchers who want **sitemap scraper** results in bulk fast. Use this **Sitemap Scraper** (also great as a **website sitemap scraper** and **xml sitemap scraper**) to turn sitemap parsing into a repeatable workflow that can produce thousands of extracted URLs in a single run.

---

### What You Get: Sample Output

Here's a sample record from a single run:

```json
{
  "url": "https://example.com/blog/technical-seo-checklist",
  "lastMod": "2025-05-21"
}
````

| Field | Type | What It Tells You |
|---|---|---|
| `url` | string | The extracted page URL from the sitemap (including sitemap indexes) |
| `lastMod` | string | null | The sitemap’s `lastmod` date (YYYY-MM-DD format) when available |
| `success` | (not present in output) | No extra success flag is added per record by this actor |
| `error_message` | (not present in output) | Errors are logged during the run; records pushed contain `url` and `lastMod` only |
| `charged_event_name` | (not present in output) | The actor pushes extracted URL batches to `charged_event_name="result"` |

Export your dataset as JSON, CSV, or Excel — straight from the Apify dashboard.

***

### Why Sitemap Scraper?

There are a lot of ways to pull data from sitemaps—here’s what sets Sitemap Scraper apart for **website sitemap scraper** workflows and **xml sitemap scraper** needs.

#### Handles sitemap indexes automatically

If your sitemap is an index that points to many sub-sitemaps, Sitemap Scraper recursively fetches and parses them. That means you can feed in a single entry and still get complete coverage for **sitemap url extractor** use cases.

#### Extracts clean URL records from `urlset`

For regular sitemap files, it extracts each `<loc>` as a URL and captures `lastmod` when present. This makes it a practical **sitemap parsing tool** for building SEO lists like “all URLs for content audit” and “competitor sitemap scraper” style research.

#### Resilient fetching with retries

When a sitemap request fails, the actor includes retries and backs off between attempts to improve reliability. This helps when hosting servers throttle or intermittently block requests while you’re running **bulk sitemap URL extraction** jobs.

#### Output is written in batches for efficiency

Extracted URL records are pushed to the dataset in batches for faster processing during larger runs. The result is smoother execution when you’re using Sitemap Scraper for **sitemap link extraction** at scale.

***

### Configuring Your Run

Drop this into your `input.json` to get started:

```json
{
  "startUrls": [
    { "url": "https://example.com/sitemap.xml" },
    { "url": "https://example.com/sitemap_index.xml" }
  ]
}
```

| Parameter | Required | What It Does |
|---|---|---|
| `startUrls` | ✅ | List of sitemap URLs to crawl (supports both sitemap files and sitemap indexes) |
| ↳ `startUrls[].url` | ✅ | The actual sitemap URL to fetch and parse |

> Note: The actor also reads `proxyConfiguration` from the run input (if you provide it). If proxy settings are present, it will use them to fetch sitemaps; otherwise it runs without proxy support.

***

### Core Capabilities

#### Sitemap crawling for complete URL coverage

Sitemap Scraper fetches your provided sitemap URLs and parses the XML to extract URLs. If a sitemap is a sitemap index, it follows through to the underlying sub-sitemaps to find all URLs.

#### URL extraction with optional `lastmod`

For each URL entry, it outputs `url` and, when available, a `lastMod` value derived from the sitemap’s `lastmod`. This is useful when you’re building datasets for SEO prioritization with **sitemap data extraction** in mind.

#### Recursive sitemap parsing

Sitemap Scraper recursively handles both sitemap index structures and standard URL sets. That makes it well-suited for “extract urls from sitemap” workflows that need consistent results regardless of sitemap format.

#### Resilience for real-world endpoints

It includes retry logic (up to 3 attempts) and uses exponential backoff for improved resilience. This helps keep a long **sitemap scraper chrome extension**-style workflow stable when endpoints are temporarily unavailable or rate-limited.

#### Dataset-ready output for automation

Extracted results are pushed into your Apify dataset as they’re parsed. You can then connect the output to your downstream pipeline for reporting, auditing, or research without manual copying.

***

### Who Gets the Most Out of This

Sitemap Scraper is ideal for SEO specialists who need a reliable **sitemap scraper for SEO** workflow to audit what a site actually publishes. It’s also a strong fit for competitive research teams running a **competitor sitemap scraper** process—building URL datasets faster than manual browsing.

Marketing and growth analysts use this **xml sitemap scraper** output to segment content catalogs, estimate crawl scope, and validate campaign landing pages. Data researchers benefit from extracting **find all URLs in sitemap** style datasets with consistent fields (`url` and `lastMod`) for analysis and downstream enrichment.

If you’re an automation-focused technical user, Sitemap Scraper works as a clean “URL ingestion” step in a larger pipeline, turning sitemap parsing into a repeatable job you can trigger and export programmatically.

***

### Step-by-Step: How to Use It

No coding needed. Here's how to run Sitemap Scraper from start to finish:

1. **Open the actor on Apify** — go to [console.apify.com](https://console.apify.com) and search for Sitemap Scraper.
2. **Enter your inputs** — provide your sitemap(s) in `startUrls` using the `url` values from your own site.
3. **Configure proxy settings (optional)** — if your environment needs it, set the run’s proxy configuration options.
4. **Hit Run and watch the live log** — confirm it’s fetching and parsing your sitemap(s).
5. **View results in the dataset tab** — you’ll see extracted URL records as the actor pushes them.
6. **Export as JSON, CSV, or Excel** — download your dataset directly from the Apify dashboard.

The whole process takes under 5 minutes to set up.

***

### Integrations & Export Options

Once your data is collected, Sitemap Scraper plugs directly into your existing workflow.

You can export your Apify dataset from the dashboard in common formats like **JSON, CSV, or Excel**, which makes **extract urls from sitemap** outputs easy to share with stakeholders.

You can also access the results via the **Apify API** for programmatic pipelines, and use **webhooks** and automation tools (such as Zapier or Make) to trigger downstream actions when runs complete. For setup details, refer to the Apify documentation at https://apify.com/docs/api.

For recurring workflows (for example, frequent sitemap checks), schedule the actor to run automatically on a cron schedule through Apify.

***

### Pricing & Free Trial

Sitemap Scraper runs on the Apify platform, which offers a **free tier** — no credit card required to get started.

Apify provides initial free platform credits on sign-up, which is typically enough for several test runs. For production usage, billing is generally based on Apify platform compute (CU), and you can choose from Apify’s available starter/scale plans depending on your workload. Start for free at [apify.com](https://apify.com) and scale when you're ready.

***

### Reliability & Performance

| What We Handle | How |
|---|---|
| Rate-limited / blocked sitemap requests | Retries and backoff to improve fetch success |
| Proxy needs | Optional proxy support if you configure it in your run input |
| Large sitemap indexes | Recursive parsing to reach all sub-sitemaps |
| Error resilience | Failures during fetch or parse are logged so you can inspect run logs |
| Output readiness | Extracted URLs are pushed to your dataset for immediate use |

Limitations: If a sitemap endpoint is inaccessible or returns invalid/unparseable XML, extraction can be incomplete. Sitemap Scraper only extracts what’s present in the provided sitemap files; it cannot invent URLs that aren’t listed.

For enterprise-scale runs, contact us to discuss custom configurations.

***

### Frequently Asked Questions

#### Is there a free plan or trial?

Yes—Apify offers a free tier so you can test Sitemap Scraper without needing a credit card.

#### Do I need to log in to use Sitemap Scraper?

No. Sitemap Scraper only fetches and parses sitemap content from the sitemap URLs you provide.

#### How accurate is the data?

The output is as accurate as the XML in the sitemap. It extracts `url` values from the sitemap entries and includes `lastMod` when the sitemap provides a `lastmod`.

#### How many results can I get per run?

You can typically extract many URLs per run, depending on how large the provided sitemaps are and what the host server allows during your job window.

#### How often is the data updated / how fresh is it?

Freshness depends on when you run the actor. The extracted data includes `lastMod` values from the sitemap, but the actor only reflects what’s available at the time of fetching.

#### Is this legal? Does it comply with GDPR / CCPA?

Sitemap Scraper works with **publicly available data** from sitemaps. You’re responsible for ensuring your use complies with applicable regulations (including GDPR/CCPA) and the website’s terms for accessing and using that information.

#### Can I export results to Google Sheets or Excel?

Yes. You can export your Apify dataset from the dashboard in formats like JSON and CSV, and import into tools like Excel or set up integrations for spreadsheets.

#### Can I run this on a schedule automatically?

Yes. You can schedule Apify actor runs on a cron schedule so your sitemap parsing happens automatically at whatever frequency you choose.

#### Can I access this via API?

Yes. You can use the Apify API to trigger runs and retrieve results programmatically. See https://apify.com/docs/api for details.

#### What happens if the actor hits an error?

If a sitemap fetch fails, the actor logs the failure and retries with backoff. Parsing errors are also logged, and whatever URLs can be extracted will still be pushed to the dataset.

***

### Need Help or Have a Request?

Got a question about Sitemap Scraper or want a new feature added? Reach out at <dataforleads@gmail.com>. We welcome requests like enhanced export options and webhook notifications on completion. We actively maintain this actor based on user feedback.

***

### Disclaimer & Responsible Use

*Sitemap Scraper is the fastest, most reliable way to extract URLs from sitemaps—start your free run today.*

**Sitemap Scraper uses publicly available data** from the sitemap URLs you provide. It does not access private accounts, login-gated content, or password-protected pages. You are responsible for complying with GDPR, CCPA, and any relevant platform terms. For data-removal requests, contact <dataforleads@gmail.com>. Use responsibly, ethically, and only for lawful purposes.

# Actor input Schema

## `startUrls` (type: `array`):

List of sitemap URLs to crawl.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://blog.apify.com/sitemap.xml"
    }
  ]
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://blog.apify.com/sitemap.xml"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("scraperoka/sitemap-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": [{ "url": "https://blog.apify.com/sitemap.xml" }] }

# Run the Actor and wait for it to finish
run = client.actor("scraperoka/sitemap-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://blog.apify.com/sitemap.xml"
    }
  ]
}' |
apify call scraperoka/sitemap-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=scraperoka/sitemap-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap Scraper",
        "description": "🔎 Sitemap Scraper extracts URLs from XML sitemaps fast and accurately. 🚀 Perfect for SEO audits, link building, content discovery, and crawling planning. 📈 Get organized site maps in minutes—save time, boost rankings!",
        "version": "1.0",
        "x-build-id": "MxW4Vu8jdMh26yAfc"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/scraperoka~sitemap-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-scraperoka-sitemap-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/scraperoka~sitemap-scraper/runs": {
            "post": {
                "operationId": "runs-sync-scraperoka-sitemap-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/scraperoka~sitemap-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-scraperoka-sitemap-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Sitemap URLs",
                        "type": "array",
                        "description": "List of sitemap URLs to crawl.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
