Sitemap Extractor: Every URL, Recursive, Reliable avatar

Sitemap Extractor: Every URL, Recursive, Reliable

Pricing

$0.08 / 1,000 extracted urls

Go to Apify Store
Sitemap Extractor: Every URL, Recursive, Reliable

Sitemap Extractor: Every URL, Recursive, Reliable

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pricing

$0.08 / 1,000 extracted urls

Rating

0.0

(0)

Developer

Pono Data

Pono Data

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

2 hours ago

Last modified

Categories

Share

Sitemap Extractor

Give it a sitemap URL, a robots.txt URL, or a site root. It returns one clean row per URL, following sitemap index files, .gz compressed sitemaps, and the Sitemap: directives in robots.txt. Every row carries the sitemap it came from, so any row is verifiable at its source.

What it does that other actors split across three

  • Finds the sitemap for you. Give a bare site root and it probes the common sitemap locations, robots.txt, and the homepage's <link rel="sitemap">, then extracts what it finds. Other actors either find the sitemap or extract one, not both.
  • Validates each URL (optional). Turn on status checking and every row gets an httpStatus and a live flag, checked concurrently. Filter to live URLs only for a list with no dead links.
  • Reads the sitemap extensions. Image, news, video, and hreflang-alternate data are pulled out per URL when the sitemap carries them, not dropped.

Plus the basics done right: index recursion, .gz, lastmod filtering, dedupe, and a sourceUrl on every row.

Input

  • Start URLs: sitemap URLs, robots.txt URLs, or site roots. A site root is resolved by probing the common locations, robots.txt, and the homepage link.
  • Max delivered URLs: cap on delivered rows (0 means no cap).
  • Max sitemap-index depth: how deep to follow nested index files.
  • Only URLs modified on/after: optional YYYY-MM-DD lastmod filter (entries with no lastmod are kept).
  • Include changefreq and priority: toggle the optional hint fields.
  • Validate URL status codes: add httpStatus and live to every row.
  • Deliver only live URLs: with validation on, drop dead URLs to the free rejected dataset.

Output

One row per URL: loc, lastmod, changefreq, priority, host, depth, the optional httpStatus and live, the image / news / video / hreflang extension fields when present, sourceUrl (the sitemap it came from), retrievedAt, confidence, dataSource.

How it works

Sitemaps are published by sites for machines to read. The actor fetches them with a declared User-Agent, decompresses .gz, follows index files up to your depth limit, dedupes by URL, and never invents a field: a value is emitted only if it is present in the sitemap. A supplied sitemap that fails to fetch or parse is recorded in the free rejected dataset; a probe of a guessed location that misses is silent. Status checks, when on, run concurrently so validation stays fast.

Billing

Pay per delivered URL row. Failed or unparseable sitemaps, and dead URLs when you filter to live only, cost nothing. Status validation is included at the same per-URL price.

Coverage

Global. Targets in any country are processed. The one exclusion is jurisdictions under US sanctions (Cuba, Iran, North Korea, Syria, Russia, Belarus, Venezuela, Myanmar, matched by country-code TLD), which are written to the free rejected dataset and never billed.

Opt out

A domain owner can ask us to skip their domain at https://ponodata.com/opt-out . Suppressed domains are returned by nothing and never billed.

Sample output

A real run extracting URLs from a sitemap, with optional live-status validation:

URLHTTPlivesource
https://developer.mozilla.org/en-US/200Truedeveloper.mozilla…
https://developer.mozilla.org/en-US/404200Truedeveloper.mozilla…
https://developer.mozilla.org/en-US/about200Truedeveloper.mozilla…
https://developer.mozilla.org/en-US/advertising200Truedeveloper.mozilla…

Every URL carries its sourceUrl (the sitemap it came from), for example https://developer.mozilla.org/sitemaps/en-us/sitemap.xml.gz.

See also

More clean, pay-only-for-results data tools from Pono Data:

Full catalog: https://apify.com/thoob