Website Sitemap Extractor avatar

Website Sitemap Extractor

Pricing

Pay per usage

Go to Apify Store
Website Sitemap Extractor

Website Sitemap Extractor

Extract all URLs from any website's sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compression. Filter by URL pattern, date range.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glass Ventures

Glass Ventures

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Extract all URLs from any website's XML sitemap. Auto-discovers sitemaps from robots.txt, supports sitemap index files and .gz compressed sitemaps. Filter by URL pattern or date range.

What does Website Sitemap Extractor do?

Website Sitemap Extractor is a universal tool that pulls every URL listed in a website's XML sitemap. Unlike other sitemap tools, you don't need to know the sitemap location -- just provide the website URL and the actor finds sitemaps automatically by checking robots.txt, then falling back to common sitemap paths like /sitemap.xml.

The actor handles the full complexity of real-world sitemaps: sitemap index files that reference dozens of child sitemaps, .gz compressed sitemaps, and deeply nested sitemap hierarchies. It follows every reference until all URLs are collected.

Each extracted URL includes its metadata from the sitemap: last modification date, change frequency, priority score, and which specific sitemap file it came from. You can filter results by URL pattern (glob matching) or by date range to get exactly the data you need.

Use Cases

  • SEO auditors -- Quickly inventory all indexed pages on a website, check lastmod dates, and identify orphan pages not in the sitemap
  • Content marketers -- Discover all blog posts, landing pages, and content URLs on competitor sites for content gap analysis
  • Web crawl planning -- Get a complete URL list before running expensive browser-based scrapers, so you can plan crawl budgets accurately
  • Migration teams -- Extract the full URL map of a website before domain migration to ensure every page gets a proper redirect
  • AI training data discovery -- Find all publicly listed URLs on a domain to build targeted web datasets for model training
  • Developers -- Validate sitemap structure, check for broken sitemap references, and audit sitemap completeness

Features

  • Auto-discovers sitemaps from robots.txt (no need to know the sitemap URL)
  • Falls back to /sitemap.xml and other common locations if robots.txt has no sitemap directive
  • Follows sitemap index files (nested sitemaps) recursively
  • Supports .gz compressed sitemaps
  • Extracts full metadata: URL, lastmod, changefreq, priority, source sitemap
  • Filter by URL glob pattern (e.g., */blog/*, *.html)
  • Filter by date range (lastmod-based)
  • Works on any website with an XML sitemap
  • Handles CDATA-wrapped content in sitemaps
  • Exports to JSON, CSV, Excel, or connect via API
  • No proxies needed for most sites (sitemaps are public)
  • Proxy support available for restricted networks

How much will it cost?

This actor is free to use. You only pay for Apify platform usage (compute and storage).

URLs ExtractedEstimated Cost
1,000~$0.01
10,000~$0.05
100,000~$0.25
1,000,000~$1.50
Cost ComponentPer 100,000 URLs
Platform compute (256 MB)~$0.15
Storage~$0.05
Proxy (if used)~$0.00 (not needed)
Total~$0.20

Sitemaps are lightweight XML files, so extraction is extremely fast and cheap compared to page-by-page scraping.

How it discovers sitemaps

  1. robots.txt -- The actor first fetches {website}/robots.txt and looks for Sitemap: directives. Most well-configured websites list their sitemaps here.
  2. Common paths -- If robots.txt has no sitemap references, the actor tries standard locations: /sitemap.xml, /sitemap_index.xml, /sitemap.xml.gz, /sitemaps.xml, /sitemap/sitemap.xml.
  3. Sitemap index files -- When a sitemap turns out to be an index (containing references to other sitemaps), the actor follows every child sitemap recursively.
  4. Compressed sitemaps -- .gz compressed sitemaps are automatically decompressed.

How to use

  1. Go to the Website Sitemap Extractor page on Apify Store
  2. Click "Start" or "Try for free"
  3. Enter website URLs (e.g., https://www.apple.com) or direct sitemap URLs
  4. Optionally set URL filter pattern and date range
  5. Set the maximum number of URLs to extract
  6. Click "Start" and wait for the results

Input parameters

ParameterTypeDescriptionDefault
startUrlsarrayWebsite URLs to discover sitemaps from-
sitemapUrlsarrayDirect sitemap XML URLs-
urlFilterPatternstringGlob pattern to filter URLs (e.g., */blog/*)-
dateFromstringOnly include URLs modified after this date-
dateTostringOnly include URLs modified before this date-
maxItemsnumberMaximum URLs to extract (0 = unlimited)10000
proxyConfigobjectProxy settings (not needed for most sites)-

Output

The actor produces a dataset with the following fields:

{
"url": "https://www.apple.com/shop/buy-iphone",
"loc": "https://www.apple.com/shop/buy-iphone",
"lastmod": "2025-12-01T00:00:00.000Z",
"changefreq": "daily",
"priority": "0.8",
"sitemapUrl": "https://www.apple.com/sitemap.xml",
"scrapedAt": "2026-04-23T10:30:00.000Z"
}
FieldTypeDescription
urlstringThe URL found in the sitemap
locstringRaw loc value from the sitemap XML
lastmodstring/nullLast modification date (if provided in sitemap)
changefreqstring/nullChange frequency: always, hourly, daily, weekly, monthly, yearly, never
prioritystring/nullPriority value from 0.0 to 1.0
sitemapUrlstringThe sitemap file this URL was extracted from
scrapedAtstringISO 8601 timestamp of extraction

Integrations

Connect Website Sitemap Extractor with other tools:

  • Apify API -- REST API for programmatic access
  • Webhooks -- get notified when a run finishes
  • Zapier / Make -- connect to 5,000+ apps
  • Google Sheets -- export directly to spreadsheets

API Example (Node.js)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_USERNAME/website-sitemap-extractor').call({
startUrls: [{ url: 'https://www.apple.com' }],
maxItems: 1000,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Found ${items.length} URLs`);

API Example (Python)

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/website-sitemap-extractor').call(run_input={
'startUrls': [{'url': 'https://www.apple.com'}],
'maxItems': 1000,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(f'Found {len(items)} URLs')

API Example (cURL)

curl "https://api.apify.com/v2/acts/YOUR_USERNAME~website-sitemap-extractor/runs" \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{"startUrls": [{"url": "https://www.apple.com"}], "maxItems": 1000}'

Tips and tricks

  • Start with a small maxItems (100) to test before running large extractions
  • Use urlFilterPattern to extract only specific sections, e.g., */blog/* for blog posts only
  • If you already know the sitemap URL, use sitemapUrls for faster processing (skips robots.txt discovery)
  • Date filters work on the lastmod field -- URLs without lastmod are always included
  • For very large sites (1M+ URLs), increase the actor memory to 1024 MB or higher

FAQ

Q: Does this actor need login credentials? A: No. XML sitemaps are publicly accessible files. No authentication is required.

Q: How fast is the extraction? A: Very fast. The actor processes XML files, not rendered web pages. Expect 10,000-50,000 URLs per minute depending on sitemap size and server response times.

Q: What if a website has no sitemap? A: The actor will check robots.txt and common sitemap paths. If no sitemap is found, the dataset will be empty and a warning is logged. Not all websites have sitemaps.

Q: Does it work with non-standard sitemap formats? A: The actor handles standard XML sitemaps (urlset), sitemap index files (sitemapindex), .gz compressed sitemaps, and CDATA-wrapped content. Non-XML formats (e.g., plain text sitemap lists) are not supported.

Q: Why are some fields null? A: The sitemap standard only requires the <loc> (URL) element. Fields like lastmod, changefreq, and priority are optional and many websites don't include them.

Q: Do I need proxies? A: Usually no. Sitemaps are meant to be publicly accessible. Only use proxies if the website blocks direct access.

XML sitemaps are explicitly published by website owners for search engines and web crawlers to consume. They are public files designed to be read by automated tools. Web scraping of publicly available data is generally legal based on precedents like the LinkedIn v. HiQ Labs case. Always review and respect the target site's Terms of Service. For more information, see Apify's blog on web scraping legality.

Limitations

  • Only processes standard XML sitemaps (not plain text URL lists or HTML sitemaps)
  • URLs without lastmod cannot be filtered by date range (they are included by default)
  • Very large sitemaps (10M+ URLs) may require increased memory allocation
  • Some websites may rate-limit or block rapid sitemap requests (enable proxies in this case)

Changelog

  • v0.1 (2026-04-23) -- Initial release with auto-discovery, sitemap index support, .gz compression, URL and date filtering