Sitemap URL Extractor avatar

Sitemap URL Extractor

Pricing

from $0.30 / 1,000 url extracteds

Go to Apify Store
Sitemap URL Extractor

Sitemap URL Extractor

Extract every URL from a website's sitemap.xml. Recursively walks nested sitemap indexes and returns loc, lastmod, changefreq, and priority for each page.

Pricing

from $0.30 / 1,000 url extracteds

Rating

0.0

(0)

Developer

Andrew

Andrew

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

5 days ago

Last modified

Share

Pull every URL from any website's sitemap.xml — automatically walks nested sitemap indexes and returns a clean dataset with loc, lastmod, changefreq, and priority for each page.

What you get

  • URL (loc) for every page listed in the site's sitemap
  • Last modified date (lastmod) — when each page was last updated
  • Change frequency (changefreq) — always, hourly, daily, weekly, monthly, yearly, never
  • Priority (priority) — relative importance of each URL (0.0 - 1.0)
  • Source sitemap — which sitemap file the URL came from (useful when a site splits its sitemap by section)
  • Auto-discovery — point at a homepage and the actor finds the sitemap via robots.txt or /sitemap.xml
  • Gzipped sitemap support — handles .xml.gz files transparently
  • Recursive sitemap index walking — follows nested <sitemapindex> files up to 5 levels deep

Use cases

  • SEO audits — pull a full URL inventory before running site-wide checks (broken links, missing meta tags, schema validation)
  • Content migration — build a complete URL list when moving a site between platforms
  • Crawl budget planning — see how many URLs a site exposes and how recently each was updated
  • Competitor research — map out every page a competitor publishes
  • Sitemap validation — verify that your published sitemap actually contains the pages you expect
  • Bulk URL scraping pipelines — feed the output into another actor for screenshots, content extraction, or AI summarization

How to use

  1. Enter a Website or Sitemap URL — either a homepage like https://www.example.com (the actor auto-discovers the sitemap) or a direct sitemap URL like https://www.example.com/sitemap.xml
  2. Set Max Items0 returns every URL in the entire sitemap tree
  3. Choose whether to Follow Sitemap Index — on by default, so a single run pulls every URL from every child sitemap
  4. Run the actor — results land in the Dataset tab
  5. Export to JSON, CSV, Excel, or Google Sheets directly from the Apify console

Extract every URL on a site

{
"websiteUrl": "https://www.apify.com",
"maxItems": 0,
"followSitemapIndex": true
}

Extract only the top-level sitemap

{
"websiteUrl": "https://www.apify.com/sitemap.xml",
"maxItems": 0,
"followSitemapIndex": false
}

Output format

One dataset record per URL:

{
"loc": "https://www.apify.com/store",
"lastmod": "2024-08-12",
"changefreq": "daily",
"priority": "0.8",
"sourceSitemap": "https://www.apify.com/sitemap.xml"
}

Fields not present in the sitemap entry come back as null.

Parameters

FieldDefaultDescription
Website or Sitemap URLhttps://www.apify.comHomepage URL (auto-discovered) or direct .xml / .xml.gz sitemap URL
Max Items0Maximum URLs to return per run. 0 = unlimited
Follow Sitemap IndextrueRecurse into child sitemaps when the top-level file is a sitemap index

Notes

  • Sitemap discovery first looks for Sitemap: directives in /robots.txt, then falls back to /sitemap.xml
  • Nested sitemap indexes are walked breadth-first; the actor de-duplicates sitemap URLs so circular references are safe
  • Recursion is capped at 5 levels deep and 1,000 total sitemaps as a safety net against runaway loops
  • Each fetched sitemap has a 30-second timeout — slow or unreachable child sitemaps are logged and skipped, the run continues
  • Gzip-compressed sitemaps (*.xml.gz) are decompressed automatically