# Sitemap Url Extractor (`solid-scraper/sitemap-url-extractor`) Actor

🔎 Extract URLs from any sitemap fast and accurately. Sitemap Url Extractor helps you discover, audit, and optimize website links for SEO, crawling, and migrations—ideal for webmasters, marketers, and developers. 🚀⚙️

- **URL**: https://apify.com/solid-scraper/sitemap-url-extractor.md
- **Developed by:** [SolidScraper](https://apify.com/solid-scraper) (community)
- **Categories:** SEO tools, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $2.99 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Sitemap URL Extractor 🔍

**Sitemap URL Extractor** automatically extracts URLs from a sitemap (including sitemap indexes) and saves them to an Apify dataset. Whether you’re doing website research, SEO auditing, or building a bulk URL list from an existing sitemap, this sitemap url extractor tool turns a single root_sitemap_url into a structured output you can use at scale—saving you hours of manual work.

---

### Why choose Sitemap URL Extractor?

| Feature | Benefit |
| --- | --- |
| ✅ **All-in-one sitemap parsing** | Extracts URLs from both direct sitemaps and sitemap indexes (recursively) |
| ✅ **Reliability-first fetching** | Includes residential proxy support for more dependable data collection |
| ✅ **Structured output saving** | Writes extracted records directly to the output dataset as they’re collected |
| ✅ **URL-focused results** | Produces a clean table of URL, lastmod, and changefreq you can export and analyze |
| ✅ **Scales from one sitemap to many** | Handles multiple sub-sitemaps when the root is a sitemap index |
| ✅ **Easy workflow integration** | Output dataset is ready for downstream processing in your pipeline |

---

### Key features

- 🧾 **Sitemap URL extraction (urlset)**: Parses `<urlset>` files and extracts each entry’s `url`, `lastmod`, and `changefreq`
- 🗂️ **Sitemap index support**: Detects `<sitemapindex>` files and processes sub-sitemaps to extract sitemap links as well
- 🔁 **Recursive sitemap parsing**: Automatically walks through sitemap indexes to gather URLs from included sitemaps
- 🌐 **XML sitemap parsing**: Works with standard sitemap XML structures using sitemap URL parsing
- 💾 **Live dataset saving**: Pushes each extracted URL record to the dataset immediately (so you don’t lose progress)
- 🛡️ **Residential proxy support**: Designed to support reliable scraping for public web data
- 📦 **Simple, analyst-friendly output**: Saves fields in a consistent structure for easy export and review

---

### Input

Provide input via an `input.json` file. Example structure:

```json
{
  "root_sitemap_url": "https://onescales.com/sitemap.xml"
}
````

#### Input Fields

| Field | Required | Description |
| --- | --- | --- |
| `root_sitemap_url` | Yes | The URL of the sitemap or sitemap index to start with. This is the root entry point for the sitemap url extractor tool. |

***

### Output

The actor saves each extracted URL record to the Apify dataset as JSON items.

```json
{
  "url": "https://example.com/page-1",
  "lastmod": "2024-01-15",
  "changefreq": "weekly"
}
```

#### Output Fields

| Field | Type | Description |
| --- | --- | --- |
| `url` | string | null | The extracted `<loc>` value for each sitemap entry |
| `lastmod` | string | null | The extracted `<lastmod>` value (if present in the sitemap) |
| `changefreq` | string | The extracted `<changefreq>` value; if missing, it defaults to `"weekly"` |

Note: The output dataset view is configured to display `url` as a link, and `lastmod` and `changefreq` as text.

***

### How to use Sitemap URL Extractor (via Apify Console)

1. **Open Apify Console**\
   Log in at https://console.apify.com and go to the **Actors** tab.

2. **Find the actor**\
   Search for **Sitemap URL Extractor** and open the actor page.

3. **Go to the INPUT section**\
   Use the built-in form (or switch to editing `input.json` directly) to provide the required input.

4. **Set `root_sitemap_url`**\
   Paste a direct sitemap URL (XML) or a sitemap index URL. This is what the sitemap url scraper will start from.

5. **Run the actor**\
   Click **Run**. During the run, you’ll see logs about fetching and whether it detected a direct urlset or a sitemap index.

6. **Monitor progress**\
   As the actor processes the sitemap index (if applicable), it extracts sitemap links and pushes results to the dataset.

7. **Open the OUTPUT dataset**\
   After completion, open the dataset named **Sitemap URLs** to view the extracted URL records.

8. **Export your results**\
   Export the dataset to JSON/CSV using Apify’s standard dataset export options (based on what your workflow needs).

No coding required—get URLs from XML sitemap files in minutes with Sitemap URL Extractor. ✅

***

### Advanced features & SEO optimization

- 🔍 **Engineered for sitemap url extraction**: Built specifically for extract urls from sitemap and pull out sitemap index URL extractor results in one go
- 🗂️ **Handles sitemap indexes automatically**: Perfect for extracting sitemap links across multiple nested sitemaps
- 📊 **Built for website sitemap parsing**: Produces a consistent structure ideal for SEO audits and crawling prep using sitemap URL parsing
- 💾 **Real-time saving to dataset**: Extracted records are pushed as they’re collected, which is helpful for large sites
- 🛡️ **Residential proxy support for public web data**: Designed to improve reliability when collecting from external hosts

***

### Best use cases

- 📈 **SEO teams auditing a website**: Build a complete URL list from an XML sitemap to verify coverage and indexation expectations
- 🧭 **Content strategists planning site-wide updates**: Quickly get URLs, last modification dates, and change frequency signals for prioritization
- 🔎 **Digital marketers running large-scale URL research**: Create bulk lists for analysis without manually opening sitemap files
- 🧪 **Data analysts preparing datasets**: Transform sitemap extractor software output into spreadsheets, BI dashboards, or downstream models
- 🌐 **Web developers building crawling pipelines**: Use extracted URLs from a sitemap index URL extractor step before running your own crawler
- 🧑‍💻 **Engineering teams automating reporting**: Incorporate sitemap url finder results into scheduled workflows and exports

***

### Technical specifications

- **Supported Input Formats**
  - ✅ `root_sitemap_url` as a **string** pointing to a sitemap or sitemap index URL

- **Proxy Support**
  - ✅ Residential proxy support is used to improve reliability when fetching public web data

- **Retry Mechanism**
  - ⚠️ Not specified in the available actor source metadata

- **Dataset Structure**
  - ✅ Outputs JSON records with `url`, `lastmod`, and `changefreq`

- **Rate Limits & Performance**
  - ⚠️ Processing speed and limits are not specified in the available actor documentation

- **Limitations**
  - ⚠️ If the sitemap cannot be fetched or parsed, results may be incomplete (the actor logs errors and stops processing in those cases)

***

### FAQ

#### Does Sitemap URL Extractor handle both sitemap indexes and direct sitemaps?

✅ Yes. It detects whether the root is a sitemap index or a direct urlset, then extracts accordingly. For sitemap indexes, it fetches and processes sub-sitemaps to extract sitemap links across the full structure.

#### What does the actor extract from each sitemap entry?

✅ It extracts the entry’s `url` (from `<loc>`), `lastmod` (from `<lastmod>`, if present), and `changefreq` (from `<changefreq>`). If `changefreq` is missing, it defaults to `"weekly"`.

#### Where do the results go after the run?

✅ The actor saves extracted items to the Apify dataset configured as **Sitemap URLs**, with fields `url`, `lastmod`, and `changefreq`.

#### Do I need to write any code to use this tool?

✅ No. You can provide input via Apify Console and then export the dataset after the actor finishes.

#### Is this meant for private websites or authenticated pages?

❌ No. This tool is intended for **publicly available** sitemap XML content. It does not target private, authenticated, or password-protected resources.

#### Can I export the extracted URLs for use in other tools?

✅ Yes. Since the actor outputs to a dataset, you can export it in standard dataset formats (for example, JSON/CSV) using Apify’s dataset export features.

#### How do I request a dataset data removal?

If you need data removal for outputs produced by this actor, contact <dataforleads@gmail.com>.

***

### Support & feature requests

Want to improve your sitemap url extraction workflow with Sitemap URL Extractor? We’d love your feedback. 💡

- 💡 **Feature Requests**: For example, enhancements like additional sitemap fields, alternate output formats, or more dataset controls would be great additions—tell us what would make this sitemap extractor software fit your pipeline better.
- 📧 **Contact**: Reach out via <dataforleads@gmail.com>.

Your input helps shape what we build next for Sitemap URL Extractor.

***

*If you’re looking for an SEO-optimized sitemap url extractor tool that turns XML sitemaps into usable datasets, Sitemap URL Extractor is built for exactly that.*\
*Run it on a sitemap url, index, or both—and extract URLs from sitemap structures at scale with confidence.*

***

### Disclaimer

This tool only accesses **publicly accessible sources** (public sitemap XML). It does not access private profiles, authenticated data, or password-protected pages.

It’s your responsibility to ensure your use complies with applicable laws and regulations (including GDPR and CCPA where relevant), as well as each website’s terms of service and any applicable anti-abuse or rate-limit requirements.

For data removal requests, contact <dataforleads@gmail.com>. Please use Sitemap URL Extractor responsibly, ethically, and for legitimate purposes only.

# Actor input Schema

## `root_sitemap_url` (type: `string`):

The URL of the sitemap or sitemap index to start with.

## Actor input object example

```json
{
  "root_sitemap_url": "https://onescales.com/sitemap.xml"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "root_sitemap_url": "https://onescales.com/sitemap.xml"
};

// Run the Actor and wait for it to finish
const run = await client.actor("solid-scraper/sitemap-url-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "root_sitemap_url": "https://onescales.com/sitemap.xml" }

# Run the Actor and wait for it to finish
run = client.actor("solid-scraper/sitemap-url-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "root_sitemap_url": "https://onescales.com/sitemap.xml"
}' |
apify call solid-scraper/sitemap-url-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=solid-scraper/sitemap-url-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap Url Extractor",
        "description": "🔎 Extract URLs from any sitemap fast and accurately. Sitemap Url Extractor helps you discover, audit, and optimize website links for SEO, crawling, and migrations—ideal for webmasters, marketers, and developers. 🚀⚙️",
        "version": "0.1",
        "x-build-id": "QpuNPojDYoELF0jrU"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/solid-scraper~sitemap-url-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-solid-scraper-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/solid-scraper~sitemap-url-extractor/runs": {
            "post": {
                "operationId": "runs-sync-solid-scraper-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/solid-scraper~sitemap-url-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-solid-scraper-sitemap-url-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "root_sitemap_url"
                ],
                "properties": {
                    "root_sitemap_url": {
                        "title": "Root Sitemap URL",
                        "type": "string",
                        "description": "The URL of the sitemap or sitemap index to start with."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
