Pricing

$0.01 / 1,000 extracted urls

Sitemap URL Extractor: Every URL, Recursive

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pricing

$0.01 / 1,000 extracted urls

Rating

0.0

(0)

Developer

Pono Data

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Sitemap Extractor

Give it a sitemap URL, a robots.txt URL, or a site root. It returns one clean row per URL, following sitemap index files, .gz compressed sitemaps, and the Sitemap: directives in robots.txt. Every row carries the sitemap it came from, so any row is verifiable at its source.

What it does that other actors split across three

Finds the sitemap for you. Give a bare site root and it probes the common sitemap locations, robots.txt, and the homepage's <link rel="sitemap">, then extracts what it finds. Other actors either find the sitemap or extract one, not both.
Validates each URL (optional). Turn on status checking and every row gets an httpStatus and a live flag, checked concurrently. Filter to live URLs only for a list with no dead links.
Reads the sitemap extensions. Image, news, video, and hreflang-alternate data are pulled out per URL when the sitemap carries them, not dropped.

Plus the basics done right: index recursion, .gz, lastmod filtering, dedupe, and a sourceUrl on every row.

Input

Start URLs: sitemap URLs, robots.txt URLs, or site roots. A site root is resolved by probing the common locations, robots.txt, and the homepage link.
Max delivered URLs: cap on delivered rows (0 means no cap).
Max sitemap-index depth: how deep to follow nested index files.
Only URLs modified on/after: optional YYYY-MM-DD lastmod filter (entries with no lastmod are kept).
Include changefreq and priority: toggle the optional hint fields.
Validate URL status codes: add httpStatus and live to every row.
Deliver only live URLs: with validation on, drop dead URLs to the free rejected dataset.

Output

One row per URL: loc, lastmod, changefreq, priority, host, depth, the optional httpStatus and live, the image / news / video / hreflang extension fields when present, sourceUrl (the sitemap it came from), retrievedAt, confidence, dataSource.

How it works

Sitemaps are published by sites for machines to read. The actor fetches them with a declared User-Agent, decompresses .gz, follows index files up to your depth limit, dedupes by URL, and never invents a field: a value is emitted only if it is present in the sitemap. A supplied sitemap that fails to fetch or parse is recorded in the free rejected dataset; a probe of a guessed location that misses is silent. Status checks, when on, run concurrently so validation stays fast.

Billing

Pay per delivered URL row. Failed or unparseable sitemaps, and dead URLs when you filter to live only, cost nothing. Status validation is included at the same per-URL price.

Coverage

Global. Targets in any country are processed. The one exclusion is jurisdictions under US sanctions (Cuba, Iran, North Korea, Syria, Russia, Belarus, Venezuela, Myanmar, matched by country-code TLD), which are written to the free rejected dataset and never billed.

Opt out

A domain owner can ask us to skip their domain at https://ponodata.com/opt-out . Suppressed domains are returned by nothing and never billed.

Sample output

A real run extracting URLs from a sitemap, with optional live-status validation:

URL	HTTP	live	source
https://developer.mozilla.org/en-US/	200	True	developer.mozilla…
https://developer.mozilla.org/en-US/404	200	True	developer.mozilla…
https://developer.mozilla.org/en-US/about	200	True	developer.mozilla…
https://developer.mozilla.org/en-US/advertising	200	True	developer.mozilla…

Every URL carries its sourceUrl (the sitemap it came from), for example https://developer.mozilla.org/sitemaps/en-us/sitemap.xml.gz.

Use cases

Seed a crawler or content audit with a site's full URL list, one clean row per URL with lastmod and changefreq when present.
Run an SEO coverage check: compare what a site publishes in its sitemaps against what is indexed, and spot stale lastmod dates.
Plan a content migration or archive: enumerate every page across a sitemap index, optionally filtered to URLs modified since a date.
Get a dead-link-free list: turn on status validation and deliver only live URLs, each with its httpStatus.

FAQ

Do I need to find the sitemap first? No. Give a bare site root and it probes the common locations, robots.txt, and the homepage link, then extracts what it finds.
Does it follow index files and gzipped sitemaps? Yes, it walks nested index files up to your depth limit and decompresses .gz sitemaps.
Does it crawl the pages themselves? No. It reads the sitemaps and returns the URLs they list; optional status validation only checks each URL's response code.
How am I billed? Per delivered URL row; failed or unparseable sitemaps, and dead URLs when you filter to live only, cost nothing.