Pricing

Pay per usage

Website Sitemap Extractor

Extract all URLs from any website's sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compression. Filter by URL pattern, date range.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glass Ventures

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What does Website Sitemap Extractor do?

Website Sitemap Extractor is a universal tool that pulls every URL listed in a website's XML sitemap. Unlike other sitemap tools, you don't need to know the sitemap location -- just provide the website URL and the actor finds sitemaps automatically by checking robots.txt, then falling back to common sitemap paths like /sitemap.xml.

The actor handles the full complexity of real-world sitemaps: sitemap index files that reference dozens of child sitemaps, .gz compressed sitemaps, and deeply nested sitemap hierarchies. It follows every reference until all URLs are collected.

Each extracted URL includes its metadata from the sitemap: last modification date, change frequency, priority score, and which specific sitemap file it came from. You can filter results by URL pattern (glob matching) or by date range to get exactly the data you need.

Use Cases

SEO auditors -- Quickly inventory all indexed pages on a website, check lastmod dates, and identify orphan pages not in the sitemap
Content marketers -- Discover all blog posts, landing pages, and content URLs on competitor sites for content gap analysis
Web crawl planning -- Get a complete URL list before running expensive browser-based scrapers, so you can plan crawl budgets accurately
Migration teams -- Extract the full URL map of a website before domain migration to ensure every page gets a proper redirect
AI training data discovery -- Find all publicly listed URLs on a domain to build targeted web datasets for model training
Developers -- Validate sitemap structure, check for broken sitemap references, and audit sitemap completeness

Features

Auto-discovers sitemaps from robots.txt (no need to know the sitemap URL)
Falls back to /sitemap.xml and other common locations if robots.txt has no sitemap directive
Follows sitemap index files (nested sitemaps) recursively
Supports .gz compressed sitemaps
Extracts full metadata: URL, lastmod, changefreq, priority, source sitemap
Filter by URL glob pattern (e.g., */blog/*, *.html)
Filter by date range (lastmod-based)
Works on any website with an XML sitemap
Handles CDATA-wrapped content in sitemaps
Exports to JSON, CSV, Excel, or connect via API
No proxies needed for most sites (sitemaps are public)
Proxy support available for restricted networks

How much will it cost?

This actor is free to use. You only pay for Apify platform usage (compute and storage).

URLs Extracted	Estimated Cost
1,000	~$0.01
10,000	~$0.05
100,000	~$0.25
1,000,000	~$1.50

Cost Component	Per 100,000 URLs
Platform compute (256 MB)	~$0.15
Storage	~$0.05
Proxy (if used)	~$0.00 (not needed)
Total	~$0.20

Sitemaps are lightweight XML files, so extraction is extremely fast and cheap compared to page-by-page scraping.

How it discovers sitemaps

robots.txt -- The actor first fetches {website}/robots.txt and looks for Sitemap: directives. Most well-configured websites list their sitemaps here.
Common paths -- If robots.txt has no sitemap references, the actor tries standard locations: /sitemap.xml, /sitemap_index.xml, /sitemap.xml.gz, /sitemaps.xml, /sitemap/sitemap.xml.
Sitemap index files -- When a sitemap turns out to be an index (containing references to other sitemaps), the actor follows every child sitemap recursively.
Compressed sitemaps -- .gz compressed sitemaps are automatically decompressed.

How to use

Go to the Website Sitemap Extractor page on Apify Store
Click "Start" or "Try for free"
Enter website URLs (e.g., https://www.apple.com) or direct sitemap URLs
Optionally set URL filter pattern and date range
Set the maximum number of URLs to extract
Click "Start" and wait for the results

Input parameters

Parameter	Type	Description	Default
startUrls	array	Website URLs to discover sitemaps from	-
sitemapUrls	array	Direct sitemap XML URLs	-
urlFilterPattern	string	Glob pattern to filter URLs (e.g., `/blog/`)	-
dateFrom	string	Only include URLs modified after this date	-
dateTo	string	Only include URLs modified before this date	-
maxItems	number	Maximum URLs to extract (0 = unlimited)	10000
proxyConfig	object	Proxy settings (not needed for most sites)	-

Output

The actor produces a dataset with the following fields:

{
    "url": "https://www.apple.com/shop/buy-iphone",
    "loc": "https://www.apple.com/shop/buy-iphone",
    "lastmod": "2025-12-01T00:00:00.000Z",
    "changefreq": "daily",
    "priority": "0.8",
    "sitemapUrl": "https://www.apple.com/sitemap.xml",
    "scrapedAt": "2026-04-23T10:30:00.000Z"
}

Field	Type	Description
url	string	The URL found in the sitemap
loc	string	Raw loc value from the sitemap XML
lastmod	string/null	Last modification date (if provided in sitemap)
changefreq	string/null	Change frequency: always, hourly, daily, weekly, monthly, yearly, never
priority	string/null	Priority value from 0.0 to 1.0
sitemapUrl	string	The sitemap file this URL was extracted from
scrapedAt	string	ISO 8601 timestamp of extraction

Integrations

Connect Website Sitemap Extractor with other tools:

Apify API -- REST API for programmatic access
Webhooks -- get notified when a run finishes
Zapier / Make -- connect to 5,000+ apps
Google Sheets -- export directly to spreadsheets

API Example (Node.js)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_USERNAME/website-sitemap-extractor').call({
    startUrls: [{ url: 'https://www.apple.com' }],
    maxItems: 1000,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Found ${items.length} URLs`);

API Example (Python)

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/website-sitemap-extractor').call(run_input={
    'startUrls': [{'url': 'https://www.apple.com'}],
    'maxItems': 1000,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(f'Found {len(items)} URLs')

API Example (cURL)

curl "https://api.apify.com/v2/acts/YOUR_USERNAME~website-sitemap-extractor/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"startUrls": [{"url": "https://www.apple.com"}], "maxItems": 1000}'

Tips and tricks

Start with a small maxItems (100) to test before running large extractions
Use urlFilterPattern to extract only specific sections, e.g., */blog/* for blog posts only
If you already know the sitemap URL, use sitemapUrls for faster processing (skips robots.txt discovery)
Date filters work on the lastmod field -- URLs without lastmod are always included
For very large sites (1M+ URLs), increase the actor memory to 1024 MB or higher

FAQ

Q: Does this actor need login credentials? A: No. XML sitemaps are publicly accessible files. No authentication is required.

Q: How fast is the extraction? A: Very fast. The actor processes XML files, not rendered web pages. Expect 10,000-50,000 URLs per minute depending on sitemap size and server response times.

Q: What if a website has no sitemap? A: The actor will check robots.txt and common sitemap paths. If no sitemap is found, the dataset will be empty and a warning is logged. Not all websites have sitemaps.

Q: Does it work with non-standard sitemap formats? A: The actor handles standard XML sitemaps (urlset), sitemap index files (sitemapindex), .gz compressed sitemaps, and CDATA-wrapped content. Non-XML formats (e.g., plain text sitemap lists) are not supported.

Q: Why are some fields null? A: The sitemap standard only requires the <loc> (URL) element. Fields like lastmod, changefreq, and priority are optional and many websites don't include them.

Q: Do I need proxies? A: Usually no. Sitemaps are meant to be publicly accessible. Only use proxies if the website blocks direct access.

Is it legal to scrape sitemaps?

XML sitemaps are explicitly published by website owners for search engines and web crawlers to consume. They are public files designed to be read by automated tools. Web scraping of publicly available data is generally legal based on precedents like the LinkedIn v. HiQ Labs case. Always review and respect the target site's Terms of Service. For more information, see Apify's blog on web scraping legality.

Limitations

Only processes standard XML sitemaps (not plain text URL lists or HTML sitemaps)
URLs without lastmod cannot be filtered by date range (they are included by default)
Very large sitemaps (10M+ URLs) may require increased memory allocation
Some websites may rate-limit or block rapid sitemap requests (enable proxies in this case)

Changelog

v0.1 (2026-04-23) -- Initial release with auto-discovery, sitemap index support, .gz compression, URL and date filtering

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

210

1.0

Sitemap Crawler - XML Sitemap URL Extractor

miccho27/sitemap-crawler

Extract all URLs from XML sitemaps (including sitemap index) and optionally audit each page

Tatsuya Mizuno

Sitemap Sniffer

crawlerbros/sitemap-sniffer

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Crawler Bros

5.0

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

262

Sitemap URL Extractor

wiry_kingdom/sitemap-url-extractor

Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.

Mohieldin Mohamed

Sitemap URL Extractor

crawlerbros/sitemap-url-extractor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.

Crawler Bros

5.0

Sitemap API

vivid_astronaut/sitemap

Fabio Suizu

Sitemap Extractor

cerebral_aluminum/sitemap-extractor

Extract all URLs from website sitemaps. Pages, images, PDFs. Handles sitemap indexes and WordPress.

Benny

Robots.txt & Sitemap Analyzer

automation-lab/robots-sitemap-analyzer

This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive...

Stas Persiianenko

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a website link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).