Pricing

Pay per usage

Sitemap Extractor (urlset + sitemap index)

Parse a sitemap.xml (urlset) or a sitemap index into a clean URL list — loc, lastmod, changefreq, priority — with automatic index detection and de-duplication. Bulk URL discovery to feed downstream scrapers and directory builds. Structured output with ok/error parity.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Tommy G

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Sitemap Extractor (Apify Actor)

Give it any public sitemap URL (sitemap.xml or a sitemap index) and get back a clean, deduped list of URLs — each with its lastmod, changefreq, and priority — plus the child sitemaps when the file is an index. HTML-only / XML parsing (no headless browser) so it's fast and cheap. Ideal for seeding a crawl, change monitoring, and site-coverage audits.

What it extracts

For each sitemap it returns one flat record with:

is_index — whether the file is a sitemap index (points to other sitemaps) or a URL set
urls[] — each entry as { loc, lastmod, changefreq, priority }
sitemaps[] — for an index, each child sitemap as { loc, lastmod }
urls_count, sitemaps_count

Plus control keys present on every row (ok and error alike, for clean buyer tables):

status, requested_url, final_url, http_status, redirected, found, complete, page_type, source, render_required, fields_found, error, extracted_at

Input

{ "startUrls": [{ "url": "https://example.com/sitemap.xml" }], "maxConcurrency": 5, "maxPages": 100 }

maxPages capped at 200, maxConcurrency at 20 (cost guard).

{
  "status": "ok",
  "requested_url": "https://example.com/sitemap.xml",
  "final_url": "https://example.com/sitemap.xml",
  "http_status": 200,
  "found": true,
  "complete": true,
  "page_type": "urlset",
  "source": "xml",
  "is_index": false,
  "urls": [
    { "loc": "https://example.com/", "lastmod": "2026-05-01", "changefreq": "daily", "priority": 1 },
    { "loc": "https://example.com/blog/post", "lastmod": "2026-04-20", "changefreq": "monthly", "priority": 0.8 }
  ],
  "sitemaps": [],
  "urls_count": 2,
  "sitemaps_count": 0,
  "extracted_at": "2026-05-29T..."
}

For a sitemap index, is_index:true and the child sitemaps appear in sitemaps[]. found:false means the file held no valid <loc> entries (e.g. an HTML page or unrelated XML). Failed fetches return the same keys with status:"error" + error.

Use cases

Crawl seeding — turn a site's sitemap into a clean URL list to feed any of the other extractors next.
Change monitoring — track lastmod across runs to see which pages were updated.
Coverage audits — count and inspect every URL a site declares it wants indexed.

Notes / safety

Reads only the public sitemap a site already publishes for crawlers — facts-only, no PII, no raw page bodies stored.
SSRF-guarded (scheme + private/metadata IP block + redirect re-check), robots-respecting, rate-limited, cost-capped — all via the shared src/lib/actor_runner.js. Entry parsing is capped to guard against oversized files.
Handles both XML and HTML-served sitemaps, sitemap indexes, and bare <loc> fallbacks. Core logic in src/extract.js (pure, unit-tested).

Run locally / test

npm install
npm test     # unit tests on the pure extractor (node:test)

Publish to Apify (account-holder's step)

$npm install -g apify-cli && apify login && apify push

Keep it free initially; enable pricing later via the adult account-holder once it shows repeat organic usage and clears a margin gate.

Sitemap URL Extractor - Get Every URL from sitemap.xml

eliai/sitemap-url-extractor

Extract every URL from any sitemap.xml, auto-following nested sitemap index files. Input: startUrls (sitemap URL). Output: JSON records with loc, lastmod, changefreq, priority, sourceSitemap. Cheap pay-per-result: $0.02 per sitemap parsed.

Anthony Snider

Sitemap URL Extractor - XML Sitemap Scraper

benthepythondev/sitemap-url-extractor

Extract URLs from XML sitemaps and sitemap indexes. Get URL, lastmod, changefreq, priority and source sitemap.

Ben

Sitemap Extractor: Website → All URLs (sitemap.xml parser)

boxbox10/sitemap-extractor

Give it a website. Get every URL from its sitemap — loc, lastmod, changefreq, priority — as one clean record per URL. Auto-discovers sitemap.xml, robots.txt Sitemap: directives, and nested sitemap indexes. Perfect for SEO audits, crawl seeding, and URL discovery.

Marvin Eguilos

Sitemap URL Extractor

seemuapps/sitemap-extractor

Extract every URL from a website's sitemap.xml. Recursively walks nested sitemap indexes and returns loc, lastmod, changefreq, and priority for each page.

Andrew

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser — fast and cheap.

Walid

Sitemap URL Extractor

mikolabs/sitemap-url-extractor

Extract every URL and its metadata from any sitemap.xml in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with url, lastmod, changefreq, priority, and more — ready to export as CSV, JSON, or Excel.

mikolabs

Sitemap URL Extractor: Every URL, Recursive

thoob/sitemap-extractor

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pono Data

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

v0iddo/sitemap-url-extractor

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

vøiddo

Sitemap to URL List Extractor

scrapeworks/sitemap-to-urls

Extract every URL from any website's sitemap as clean JSON. Handles sitemap indexes (recursive) and gzipped sitemaps automatically. Includes lastmod, priority, and changefreq.

Nicolas van Arkens

Sitemap URL Extractor

crawlerbros/sitemap-url-extractor

Extract every URL from any site's sitemap.xml with handles sitemap index files (nested sitemaps), gzipped sitemaps, and robots.txt discovery. Returns URL, lastmod, changefreq, priority, and optional image/video/alternate-language fields. No proxy, no cookies, no login.