Sitemap URL Extractor avatar

Sitemap URL Extractor

Under maintenance

Pricing

Pay per event

Go to Apify Store
Sitemap URL Extractor

Sitemap URL Extractor

Under maintenance

Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Mohieldin Mohamed

Mohieldin Mohamed

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Extract every URL from any website in seconds — with lastmod, changefreq, and priority metadata intact.

This actor walks a site's robots.txt, discovers every declared sitemap, recursively expands sitemap index files, and dumps every single URL it finds into a structured Apify dataset you can download as JSON, CSV, or Excel.

What does Sitemap URL Extractor do?

Give it one website URL. It returns every URL that site publishes in its sitemap.xml — including URLs buried inside multi-level sitemap index files, gzipped sitemaps, and sitemaps referenced from robots.txt. Perfect for SEO audits, content migrations, site inventory, and competitor research. No API keys. No browser. No proxy required for most sites.

Try it: paste https://apify.com into the Start URLs field, press Start, and watch the dataset fill up with every indexable URL on the site. A typical mid-sized company site (5,000–50,000 URLs) finishes in under a minute.

Apify platform advantages include scheduled runs (daily sitemap snapshots), API access, webhook integrations, proxy rotation when needed, and run history.

Why use Sitemap URL Extractor?

  • SEO audits — see every URL Google is supposed to index and compare against your canonical list
  • Content migration — pull your entire old site's URL list before moving to a new CMS
  • Competitor intelligence — see every public page a competitor publishes, including product catalogs and blog archives
  • Link checking — feed the output into a link checker to find every broken link on a site
  • Snapshots over time — schedule daily runs and diff URL lists to detect content changes
  • Dataset for LLM training — get a clean list of URLs to feed into a content extractor

How to use Sitemap URL Extractor

  1. Click Try for free (or Start if you're already logged in)
  2. In the Start URLs field, paste one or more website root URLs (e.g. https://example.com)
  3. Optionally set Max URLs per site to cap output size
  4. Click Start
  5. Watch the dataset populate in real time in the Output tab
  6. Download as JSON, CSV, or Excel, or hit the API endpoint directly

Input

  • Start URLs — one or more website root URLs to crawl (e.g. https://apify.com)
  • Max URLs per site — safety cap (default 10,000, use 0 for unlimited)
  • Include metadata — attach lastmod, changefreq, priority to each URL (default: yes)
  • Follow sitemap index — recursively expand nested <sitemapindex> files (default: yes)
  • Proxy configuration — optional Apify Proxy for sites that block raw server IPs

Output

The actor pushes one dataset item per extracted URL. You can download in JSON, CSV, HTML, or Excel.

{
"url": "https://apify.com/apify/instagram-scraper",
"lastmod": "2025-03-14",
"changefreq": "daily",
"priority": 0.8,
"sourceWebsite": "https://apify.com",
"sitemapUrl": "https://apify.com/sitemap.xml",
"sitemapDepth": 0,
"discoveredAt": "2026-04-15T18:30:00.000Z"
}

Data table

FieldTypeDescription
urlstringThe extracted URL from the sitemap
lastmodstringLast modification date from the sitemap (if present)
changefreqstringHow often the page is expected to change (daily, weekly, monthly, etc.)
prioritynumberSEO priority hint from 0.0 to 1.0
sourceWebsitestringThe root URL you started from
sitemapUrlstringThe specific sitemap file where this URL was found
sitemapDepthnumberNesting depth in sitemap index (0 = root sitemap)
discoveredAtstringISO timestamp of when the URL was extracted

Pricing

This actor uses Apify's pay-per-event pricing model so you only pay for what you get:

  • Actor start: $0.01 per run (covers robots.txt + sitemap fetches)
  • Per URL extracted: $0.0005 per URL added to your dataset

Example costs:

  • A small blog with 500 URLs → ~$0.26
  • A mid-sized site with 5,000 URLs → ~$2.51
  • A large catalog with 50,000 URLs → ~$25.01

Free Apify tier members get $5/month in platform credits, which covers ~10,000 URLs of extraction per month.

Tips and advanced options

  • Set maxUrlsPerSite to a safe cap during testing (e.g. 100) to verify the actor works before running unlimited
  • Disable includeMetadata if you only need URLs — this produces a much smaller dataset and faster downloads
  • Disable followSitemapIndex to get only the top-level sitemap contents (useful for homepage/landing-page inventories)
  • Enable Apify Proxy for sites that return 403 or ratelimit direct requests (government sites, some news publishers)
  • Schedule daily runs via Apify's scheduler to track how a competitor's URL list changes over time — diff the datasets to see new product launches or archived content

FAQ and support

Is this legal? The actor only reads publicly declared sitemap files. Sitemaps exist to be read by crawlers — by convention (and by the intent of the site owner who published them) they are meant for public consumption. Always respect the target site's Terms of Service and robots.txt disallow rules.

What about gzipped sitemaps? Fully supported. The actor auto-detects .gz URLs and Content-Encoding: gzip responses and decompresses transparently.

What about nested sitemap indexes? Supported up to 5 levels deep. Most sites have at most 2 levels (index → sitemap → urls).

The actor returned 0 URLs, help! The site probably doesn't publish a public sitemap. Try adding a custom sitemap URL explicitly in the Start URLs field (e.g. https://example.com/sitemap_index.xml).

Found a bug or missing feature? Open an issue on the Issues tab of this actor. Custom solutions available for enterprise use cases.