Sitemap Extractor (urlset + sitemap index) avatar

Sitemap Extractor (urlset + sitemap index)

Pricing

Pay per usage

Go to Apify Store
Sitemap Extractor (urlset + sitemap index)

Sitemap Extractor (urlset + sitemap index)

Parse a sitemap.xml (urlset) or a sitemap index into a clean URL list — loc, lastmod, changefreq, priority — with automatic index detection and de-duplication. Bulk URL discovery to feed downstream scrapers and directory builds. Structured output with ok/error parity.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Tommy G

Tommy G

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 days ago

Last modified

Share

Sitemap Extractor (Apify Actor)

Give it any public sitemap URL (sitemap.xml or a sitemap index) and get back a clean, deduped list of URLs — each with its lastmod, changefreq, and priority — plus the child sitemaps when the file is an index. HTML-only / XML parsing (no headless browser) so it's fast and cheap. Ideal for seeding a crawl, change monitoring, and site-coverage audits.

What it extracts

For each sitemap it returns one flat record with:

  • is_index — whether the file is a sitemap index (points to other sitemaps) or a URL set
  • urls[] — each entry as { loc, lastmod, changefreq, priority }
  • sitemaps[] — for an index, each child sitemap as { loc, lastmod }
  • urls_count, sitemaps_count

Plus control keys present on every row (ok and error alike, for clean buyer tables):

status, requested_url, final_url, http_status, redirected, found, complete, page_type, source, render_required, fields_found, error, extracted_at
.

Input

{ "startUrls": [{ "url": "https://example.com/sitemap.xml" }], "maxConcurrency": 5, "maxPages": 100 }

maxPages capped at 200, maxConcurrency at 20 (cost guard).

Output — one STABLE record per URL (ok and error rows share the shape)

{
"status": "ok",
"requested_url": "https://example.com/sitemap.xml",
"final_url": "https://example.com/sitemap.xml",
"http_status": 200,
"found": true,
"complete": true,
"page_type": "urlset",
"source": "xml",
"is_index": false,
"urls": [
{ "loc": "https://example.com/", "lastmod": "2026-05-01", "changefreq": "daily", "priority": 1 },
{ "loc": "https://example.com/blog/post", "lastmod": "2026-04-20", "changefreq": "monthly", "priority": 0.8 }
],
"sitemaps": [],
"urls_count": 2,
"sitemaps_count": 0,
"extracted_at": "2026-05-29T..."
}

For a sitemap index, is_index:true and the child sitemaps appear in sitemaps[]. found:false means the file held no valid <loc> entries (e.g. an HTML page or unrelated XML). Failed fetches return the same keys with status:"error" + error.

Use cases

  • Crawl seeding — turn a site's sitemap into a clean URL list to feed any of the other extractors next.
  • Change monitoring — track lastmod across runs to see which pages were updated.
  • Coverage audits — count and inspect every URL a site declares it wants indexed.

Notes / safety

  • Reads only the public sitemap a site already publishes for crawlers — facts-only, no PII, no raw page bodies stored.
  • SSRF-guarded (scheme + private/metadata IP block + redirect re-check), robots-respecting, rate-limited, cost-capped — all via the shared src/lib/actor_runner.js. Entry parsing is capped to guard against oversized files.
  • Handles both XML and HTML-served sitemaps, sitemap indexes, and bare <loc> fallbacks. Core logic in src/extract.js (pure, unit-tested).

Run locally / test

npm install
npm test # unit tests on the pure extractor (node:test)

Publish to Apify (account-holder's step)

$npm install -g apify-cli && apify login && apify push

Keep it free initially; enable pricing later via the adult account-holder once it shows repeat organic usage and clears a margin gate.