Sitemap URL Extractor
Pricing
from $0.30 / 1,000 url extracteds
Go to Apify Store
Sitemap URL Extractor
Extract every URL from a website's sitemap.xml. Recursively walks nested sitemap indexes and returns loc, lastmod, changefreq, and priority for each page.
Pricing
from $0.30 / 1,000 url extracteds
Rating
0.0
(0)
Developer
Andrew
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
5 days ago
Last modified
Categories
Share
Pull every URL from any website's sitemap.xml — automatically walks nested sitemap indexes and returns a clean dataset with loc, lastmod, changefreq, and priority for each page.
What you get
- URL (
loc) for every page listed in the site's sitemap - Last modified date (
lastmod) — when each page was last updated - Change frequency (
changefreq) —always,hourly,daily,weekly,monthly,yearly,never - Priority (
priority) — relative importance of each URL (0.0 - 1.0) - Source sitemap — which sitemap file the URL came from (useful when a site splits its sitemap by section)
- Auto-discovery — point at a homepage and the actor finds the sitemap via
robots.txtor/sitemap.xml - Gzipped sitemap support — handles
.xml.gzfiles transparently - Recursive sitemap index walking — follows nested
<sitemapindex>files up to 5 levels deep
Use cases
- SEO audits — pull a full URL inventory before running site-wide checks (broken links, missing meta tags, schema validation)
- Content migration — build a complete URL list when moving a site between platforms
- Crawl budget planning — see how many URLs a site exposes and how recently each was updated
- Competitor research — map out every page a competitor publishes
- Sitemap validation — verify that your published sitemap actually contains the pages you expect
- Bulk URL scraping pipelines — feed the output into another actor for screenshots, content extraction, or AI summarization
How to use
- Enter a Website or Sitemap URL — either a homepage like
https://www.example.com(the actor auto-discovers the sitemap) or a direct sitemap URL likehttps://www.example.com/sitemap.xml - Set Max Items —
0returns every URL in the entire sitemap tree - Choose whether to Follow Sitemap Index — on by default, so a single run pulls every URL from every child sitemap
- Run the actor — results land in the Dataset tab
- Export to JSON, CSV, Excel, or Google Sheets directly from the Apify console
Extract every URL on a site
{"websiteUrl": "https://www.apify.com","maxItems": 0,"followSitemapIndex": true}
Extract only the top-level sitemap
{"websiteUrl": "https://www.apify.com/sitemap.xml","maxItems": 0,"followSitemapIndex": false}
Output format
One dataset record per URL:
{"loc": "https://www.apify.com/store","lastmod": "2024-08-12","changefreq": "daily","priority": "0.8","sourceSitemap": "https://www.apify.com/sitemap.xml"}
Fields not present in the sitemap entry come back as null.
Parameters
| Field | Default | Description |
|---|---|---|
| Website or Sitemap URL | https://www.apify.com | Homepage URL (auto-discovered) or direct .xml / .xml.gz sitemap URL |
| Max Items | 0 | Maximum URLs to return per run. 0 = unlimited |
| Follow Sitemap Index | true | Recurse into child sitemaps when the top-level file is a sitemap index |
Notes
- Sitemap discovery first looks for
Sitemap:directives in/robots.txt, then falls back to/sitemap.xml - Nested sitemap indexes are walked breadth-first; the actor de-duplicates sitemap URLs so circular references are safe
- Recursion is capped at 5 levels deep and 1,000 total sitemaps as a safety net against runaway loops
- Each fetched sitemap has a 30-second timeout — slow or unreachable child sitemaps are logged and skipped, the run continues
- Gzip-compressed sitemaps (
*.xml.gz) are decompressed automatically