Sitemap URL Extractor
Pricing
Pay per event
Sitemap URL Extractor
Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Mohieldin Mohamed
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Extract every URL from any website in seconds — with lastmod, changefreq, and priority metadata intact.
This actor walks a site's robots.txt, discovers every declared sitemap, recursively expands sitemap index files, and dumps every single URL it finds into a structured Apify dataset you can download as JSON, CSV, or Excel.
What does Sitemap URL Extractor do?
Give it one website URL. It returns every URL that site publishes in its sitemap.xml — including URLs buried inside multi-level sitemap index files, gzipped sitemaps, and sitemaps referenced from robots.txt. Perfect for SEO audits, content migrations, site inventory, and competitor research. No API keys. No browser. No proxy required for most sites.
Try it: paste https://apify.com into the Start URLs field, press Start, and watch the dataset fill up with every indexable URL on the site. A typical mid-sized company site (5,000–50,000 URLs) finishes in under a minute.
Apify platform advantages include scheduled runs (daily sitemap snapshots), API access, webhook integrations, proxy rotation when needed, and run history.
Why use Sitemap URL Extractor?
- SEO audits — see every URL Google is supposed to index and compare against your canonical list
- Content migration — pull your entire old site's URL list before moving to a new CMS
- Competitor intelligence — see every public page a competitor publishes, including product catalogs and blog archives
- Link checking — feed the output into a link checker to find every broken link on a site
- Snapshots over time — schedule daily runs and diff URL lists to detect content changes
- Dataset for LLM training — get a clean list of URLs to feed into a content extractor
How to use Sitemap URL Extractor
- Click Try for free (or Start if you're already logged in)
- In the Start URLs field, paste one or more website root URLs (e.g.
https://example.com) - Optionally set Max URLs per site to cap output size
- Click Start
- Watch the dataset populate in real time in the Output tab
- Download as JSON, CSV, or Excel, or hit the API endpoint directly
Input
- Start URLs — one or more website root URLs to crawl (e.g.
https://apify.com) - Max URLs per site — safety cap (default 10,000, use 0 for unlimited)
- Include metadata — attach
lastmod,changefreq,priorityto each URL (default: yes) - Follow sitemap index — recursively expand nested
<sitemapindex>files (default: yes) - Proxy configuration — optional Apify Proxy for sites that block raw server IPs
Output
The actor pushes one dataset item per extracted URL. You can download in JSON, CSV, HTML, or Excel.
{"url": "https://apify.com/apify/instagram-scraper","lastmod": "2025-03-14","changefreq": "daily","priority": 0.8,"sourceWebsite": "https://apify.com","sitemapUrl": "https://apify.com/sitemap.xml","sitemapDepth": 0,"discoveredAt": "2026-04-15T18:30:00.000Z"}
Data table
| Field | Type | Description |
|---|---|---|
url | string | The extracted URL from the sitemap |
lastmod | string | Last modification date from the sitemap (if present) |
changefreq | string | How often the page is expected to change (daily, weekly, monthly, etc.) |
priority | number | SEO priority hint from 0.0 to 1.0 |
sourceWebsite | string | The root URL you started from |
sitemapUrl | string | The specific sitemap file where this URL was found |
sitemapDepth | number | Nesting depth in sitemap index (0 = root sitemap) |
discoveredAt | string | ISO timestamp of when the URL was extracted |
Pricing
This actor uses Apify's pay-per-event pricing model so you only pay for what you get:
- Actor start: $0.01 per run (covers robots.txt + sitemap fetches)
- Per URL extracted: $0.0005 per URL added to your dataset
Example costs:
- A small blog with 500 URLs → ~$0.26
- A mid-sized site with 5,000 URLs → ~$2.51
- A large catalog with 50,000 URLs → ~$25.01
Free Apify tier members get $5/month in platform credits, which covers ~10,000 URLs of extraction per month.
Tips and advanced options
- Set
maxUrlsPerSiteto a safe cap during testing (e.g. 100) to verify the actor works before running unlimited - Disable
includeMetadataif you only need URLs — this produces a much smaller dataset and faster downloads - Disable
followSitemapIndexto get only the top-level sitemap contents (useful for homepage/landing-page inventories) - Enable Apify Proxy for sites that return 403 or ratelimit direct requests (government sites, some news publishers)
- Schedule daily runs via Apify's scheduler to track how a competitor's URL list changes over time — diff the datasets to see new product launches or archived content
FAQ and support
Is this legal? The actor only reads publicly declared sitemap files. Sitemaps exist to be read by crawlers — by convention (and by the intent of the site owner who published them) they are meant for public consumption. Always respect the target site's Terms of Service and robots.txt disallow rules.
What about gzipped sitemaps? Fully supported. The actor auto-detects .gz URLs and Content-Encoding: gzip responses and decompresses transparently.
What about nested sitemap indexes? Supported up to 5 levels deep. Most sites have at most 2 levels (index → sitemap → urls).
The actor returned 0 URLs, help! The site probably doesn't publish a public sitemap. Try adding a custom sitemap URL explicitly in the Start URLs field (e.g. https://example.com/sitemap_index.xml).
Found a bug or missing feature? Open an issue on the Issues tab of this actor. Custom solutions available for enterprise use cases.