Sitemap Extractor (urlset + sitemap index)
Pricing
Pay per usage
Sitemap Extractor (urlset + sitemap index)
Parse a sitemap.xml (urlset) or a sitemap index into a clean URL list — loc, lastmod, changefreq, priority — with automatic index detection and de-duplication. Bulk URL discovery to feed downstream scrapers and directory builds. Structured output with ok/error parity.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Tommy G
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
9 days ago
Last modified
Categories
Share
Sitemap Extractor (Apify Actor)
Give it any public sitemap URL (sitemap.xml or a sitemap index) and get back a clean,
deduped list of URLs — each with its lastmod, changefreq, and priority — plus the
child sitemaps when the file is an index. HTML-only / XML parsing (no headless browser) so it's
fast and cheap. Ideal for seeding a crawl, change monitoring, and site-coverage audits.
What it extracts
For each sitemap it returns one flat record with:
- is_index — whether the file is a sitemap index (points to other sitemaps) or a URL set
- urls[] — each entry as
{ loc, lastmod, changefreq, priority } - sitemaps[] — for an index, each child sitemap as
{ loc, lastmod } - urls_count, sitemaps_count
Plus control keys present on every row (ok and error alike, for clean buyer tables):
status, requested_url, final_url, http_status, redirected, found, complete, page_type, source, render_required, fields_found, error, extracted_atInput
{ "startUrls": [{ "url": "https://example.com/sitemap.xml" }], "maxConcurrency": 5, "maxPages": 100 }
maxPages capped at 200, maxConcurrency at 20 (cost guard).
Output — one STABLE record per URL (ok and error rows share the shape)
{"status": "ok","requested_url": "https://example.com/sitemap.xml","final_url": "https://example.com/sitemap.xml","http_status": 200,"found": true,"complete": true,"page_type": "urlset","source": "xml","is_index": false,"urls": [{ "loc": "https://example.com/", "lastmod": "2026-05-01", "changefreq": "daily", "priority": 1 },{ "loc": "https://example.com/blog/post", "lastmod": "2026-04-20", "changefreq": "monthly", "priority": 0.8 }],"sitemaps": [],"urls_count": 2,"sitemaps_count": 0,"extracted_at": "2026-05-29T..."}
For a sitemap index, is_index:true and the child sitemaps appear in sitemaps[].
found:false means the file held no valid <loc> entries (e.g. an HTML page or unrelated XML).
Failed fetches return the same keys with status:"error" + error.
Use cases
- Crawl seeding — turn a site's sitemap into a clean URL list to feed any of the other extractors next.
- Change monitoring — track
lastmodacross runs to see which pages were updated. - Coverage audits — count and inspect every URL a site declares it wants indexed.
Notes / safety
- Reads only the public sitemap a site already publishes for crawlers — facts-only, no PII, no raw page bodies stored.
- SSRF-guarded (scheme + private/metadata IP block + redirect re-check), robots-respecting,
rate-limited, cost-capped — all via the shared
src/lib/actor_runner.js. Entry parsing is capped to guard against oversized files. - Handles both XML and HTML-served sitemaps, sitemap indexes, and bare
<loc>fallbacks. Core logic insrc/extract.js(pure, unit-tested).
Run locally / test
npm installnpm test # unit tests on the pure extractor (node:test)
Publish to Apify (account-holder's step)
$npm install -g apify-cli && apify login && apify push
Keep it free initially; enable pricing later via the adult account-holder once it shows repeat organic usage and clears a margin gate.