Website Sitemap Extractor
Pricing
Pay per usage
Website Sitemap Extractor
Extract all URLs from any website's sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compression. Filter by URL pattern, date range.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Glass Ventures
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Extract all URLs from any website's XML sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compressed sitemaps. Filter by URL pattern or date range.
What does Website Sitemap Extractor do?
Website Sitemap Extractor is a universal tool that pulls every URL listed in a website's XML sitemap. Unlike other sitemap tools, you don't need to know the sitemap location -- just provide the website URL and the actor finds sitemaps automatically by checking robots.txt, then falling back to common sitemap paths like /sitemap.xml.
The actor handles the full complexity of real-world sitemaps: sitemap index files that reference dozens of child sitemaps, .gz compressed sitemaps, and deeply nested sitemap hierarchies. It follows every reference until all URLs are collected.
Each extracted URL includes its metadata from the sitemap: last modification date, change frequency, priority score, and which specific sitemap file it came from. You can filter results by URL pattern (glob matching) or by date range to get exactly the data you need.
Use Cases
- SEO auditors -- Quickly inventory all indexed pages on a website, check lastmod dates, and identify orphan pages not in the sitemap
- Content marketers -- Discover all blog posts, landing pages, and content URLs on competitor sites for content gap analysis
- Web crawl planning -- Get a complete URL list before running expensive browser-based scrapers, so you can plan crawl budgets accurately
- Migration teams -- Extract the full URL map of a website before domain migration to ensure every page gets a proper redirect
- AI training data discovery -- Find all publicly listed URLs on a domain to build targeted web datasets for model training
- Developers -- Validate sitemap structure, check for broken sitemap references, and audit sitemap completeness
Features
- Auto-discovers sitemaps from robots.txt (no need to know the sitemap URL)
- Falls back to /sitemap.xml and other common locations if robots.txt has no sitemap directive
- Follows sitemap index files (nested sitemaps) recursively
- Supports .gz compressed sitemaps
- Extracts full metadata: URL, lastmod, changefreq, priority, source sitemap
- Filter by URL glob pattern (e.g.,
*/blog/*,*.html) - Filter by date range (lastmod-based)
- Works on any website with an XML sitemap
- Handles CDATA-wrapped content in sitemaps
- Exports to JSON, CSV, Excel, or connect via API
- No proxies needed for most sites (sitemaps are public)
- Proxy support available for restricted networks
How much will it cost?
This actor is free to use. You only pay for Apify platform usage (compute and storage).
| URLs Extracted | Estimated Cost |
|---|---|
| 1,000 | ~$0.01 |
| 10,000 | ~$0.05 |
| 100,000 | ~$0.25 |
| 1,000,000 | ~$1.50 |
| Cost Component | Per 100,000 URLs |
|---|---|
| Platform compute (256 MB) | ~$0.15 |
| Storage | ~$0.05 |
| Proxy (if used) | ~$0.00 (not needed) |
| Total | ~$0.20 |
Sitemaps are lightweight XML files, so extraction is extremely fast and cheap compared to page-by-page scraping.
How it discovers sitemaps
- robots.txt -- The actor first fetches
{website}/robots.txtand looks forSitemap:directives. Most well-configured websites list their sitemaps here. - Common paths -- If robots.txt has no sitemap references, the actor tries standard locations:
/sitemap.xml,/sitemap_index.xml,/sitemap.xml.gz,/sitemaps.xml,/sitemap/sitemap.xml. - Sitemap index files -- When a sitemap turns out to be an index (containing references to other sitemaps), the actor follows every child sitemap recursively.
- Compressed sitemaps --
.gzcompressed sitemaps are automatically decompressed.
How to use
- Go to the Website Sitemap Extractor page on Apify Store
- Click "Start" or "Try for free"
- Enter website URLs (e.g.,
https://www.apple.com) or direct sitemap URLs - Optionally set URL filter pattern and date range
- Set the maximum number of URLs to extract
- Click "Start" and wait for the results
Input parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
| startUrls | array | Website URLs to discover sitemaps from | - |
| sitemapUrls | array | Direct sitemap XML URLs | - |
| urlFilterPattern | string | Glob pattern to filter URLs (e.g., */blog/*) | - |
| dateFrom | string | Only include URLs modified after this date | - |
| dateTo | string | Only include URLs modified before this date | - |
| maxItems | number | Maximum URLs to extract (0 = unlimited) | 10000 |
| proxyConfig | object | Proxy settings (not needed for most sites) | - |
Output
The actor produces a dataset with the following fields:
{"url": "https://www.apple.com/shop/buy-iphone","loc": "https://www.apple.com/shop/buy-iphone","lastmod": "2025-12-01T00:00:00.000Z","changefreq": "daily","priority": "0.8","sitemapUrl": "https://www.apple.com/sitemap.xml","scrapedAt": "2026-04-23T10:30:00.000Z"}
| Field | Type | Description |
|---|---|---|
| url | string | The URL found in the sitemap |
| loc | string | Raw loc value from the sitemap XML |
| lastmod | string/null | Last modification date (if provided in sitemap) |
| changefreq | string/null | Change frequency: always, hourly, daily, weekly, monthly, yearly, never |
| priority | string/null | Priority value from 0.0 to 1.0 |
| sitemapUrl | string | The sitemap file this URL was extracted from |
| scrapedAt | string | ISO 8601 timestamp of extraction |
Integrations
Connect Website Sitemap Extractor with other tools:
- Apify API -- REST API for programmatic access
- Webhooks -- get notified when a run finishes
- Zapier / Make -- connect to 5,000+ apps
- Google Sheets -- export directly to spreadsheets
API Example (Node.js)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('YOUR_USERNAME/website-sitemap-extractor').call({startUrls: [{ url: 'https://www.apple.com' }],maxItems: 1000,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Found ${items.length} URLs`);
API Example (Python)
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('YOUR_USERNAME/website-sitemap-extractor').call(run_input={'startUrls': [{'url': 'https://www.apple.com'}],'maxItems': 1000,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(f'Found {len(items)} URLs')
API Example (cURL)
curl "https://api.apify.com/v2/acts/YOUR_USERNAME~website-sitemap-extractor/runs" \-X POST \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_TOKEN" \-d '{"startUrls": [{"url": "https://www.apple.com"}], "maxItems": 1000}'
Tips and tricks
- Start with a small
maxItems(100) to test before running large extractions - Use
urlFilterPatternto extract only specific sections, e.g.,*/blog/*for blog posts only - If you already know the sitemap URL, use
sitemapUrlsfor faster processing (skips robots.txt discovery) - Date filters work on the
lastmodfield -- URLs without lastmod are always included - For very large sites (1M+ URLs), increase the actor memory to 1024 MB or higher
FAQ
Q: Does this actor need login credentials? A: No. XML sitemaps are publicly accessible files. No authentication is required.
Q: How fast is the extraction? A: Very fast. The actor processes XML files, not rendered web pages. Expect 10,000-50,000 URLs per minute depending on sitemap size and server response times.
Q: What if a website has no sitemap? A: The actor will check robots.txt and common sitemap paths. If no sitemap is found, the dataset will be empty and a warning is logged. Not all websites have sitemaps.
Q: Does it work with non-standard sitemap formats? A: The actor handles standard XML sitemaps (urlset), sitemap index files (sitemapindex), .gz compressed sitemaps, and CDATA-wrapped content. Non-XML formats (e.g., plain text sitemap lists) are not supported.
Q: Why are some fields null?
A: The sitemap standard only requires the <loc> (URL) element. Fields like lastmod, changefreq, and priority are optional and many websites don't include them.
Q: Do I need proxies? A: Usually no. Sitemaps are meant to be publicly accessible. Only use proxies if the website blocks direct access.
Is it legal to scrape sitemaps?
XML sitemaps are explicitly published by website owners for search engines and web crawlers to consume. They are public files designed to be read by automated tools. Web scraping of publicly available data is generally legal based on precedents like the LinkedIn v. HiQ Labs case. Always review and respect the target site's Terms of Service. For more information, see Apify's blog on web scraping legality.
Limitations
- Only processes standard XML sitemaps (not plain text URL lists or HTML sitemaps)
- URLs without
lastmodcannot be filtered by date range (they are included by default) - Very large sitemaps (10M+ URLs) may require increased memory allocation
- Some websites may rate-limit or block rapid sitemap requests (enable proxies in this case)
Changelog
- v0.1 (2026-04-23) -- Initial release with auto-discovery, sitemap index support, .gz compression, URL and date filtering