Pricing

from $1.00 / 1,000 results

Sitemap Sniffer

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What it does

You point this actor at any website and get back a structured list of every sitemap file it could find:

/robots.txt directives — the canonical place sites declare their sitemaps.
Common sitemap paths — sitemap.xml, sitemap_index.xml, wp-sitemap.xml, post-sitemap.xml, sitemap.xml.gz, and 11 more.
Sitemap-index expansion — when an index points to child sitemaps, the actor follows it (one level deep) and emits each child too.

For each discovered sitemap, the actor reports the URL, type (sitemap / sitemap_index / txt), HTTP status, content type, byte size, URL count (parsed from XML), gzip flag, last-modified date if present, and how it was discovered.

Input

Field	Type	Default	Description
`url`	string (required)	`https://apify.com`	Root URL or bare host (e.g. `example.com`). The actor extracts the origin and probes that.
`followIndexes`	boolean	`true`	When a sitemap-index is found, also fetch and emit the child sitemap URLs it points to.
`maxSitemaps`	integer	`50` (1–1000)	Hard cap on the number of records emitted. Probing stops once this many are discovered.
`fetchUrlCounts`	boolean	`true`	Parse each sitemap and report the number of URLs it contains. Disable to skip the full-body download.
`emitUrls`	boolean	`false`	When `true`, the actor also emits one record per URL found inside each discovered sitemap (with `lastmod`, `changefreq`, `priority`, `hreflang` when present).
`maxUrls`	integer	`10000` (1–100000)	Hard cap on per-URL records when `emitUrls: true`. Has no effect when `emitUrls: false`.
`userAgent`	string (optional)	Chrome 131	Override only if a target server filters by UA.

Example input

{
  "url": "https://www.bbc.com",
  "followIndexes": true,
  "maxSitemaps": 50,
  "fetchUrlCounts": true
}

Output

By default, one record per discovered sitemap. When emitUrls: true, the dataset also contains one record per URL found inside each sitemap. The two shapes can be disambiguated by recordType. Empty fields are omitted (no nulls).

Sitemap record (`recordType: "sitemap"`)

{
  "recordType": "sitemap",
  "url": "https://www.bbc.com/sitemap.xml",
  "domainHost": "www.bbc.com",
  "type": "sitemap_index",
  "httpStatus": 200,
  "contentType": "application/xml",
  "byteCount": 13450,
  "urlCount": 78,
  "isCompressed": false,
  "lastmod": "2024-12-15",
  "discoveredVia": "robots.txt",
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}

URL record (`recordType: "url"`, only when `emitUrls: true`)

{
  "recordType": "url",
  "url": "https://www.bbc.com/news/articles/c-12345",
  "domainHost": "www.bbc.com",
  "sitemapUrl": "https://www.bbc.com/sitemaps/news/sitemap.xml",
  "lastmod": "2024-12-15",
  "changefreq": "hourly",
  "priority": 0.8,
  "hreflang": [{"lang": "en-GB", "href": "https://www.bbc.com/news/articles/c-12345"}],
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}

Output fields

recordType — "sitemap" for sitemap-file records (always emitted), or "url" for per-URL records (only when emitUrls: true).
url — absolute URL of the sitemap (or the URL referenced inside a sitemap, when recordType: "url").
domainHost — parsed hostname of url (handy for grouping records by site when ingesting from multiple runs).
type — "sitemap" (a <urlset>), "sitemap_index" (a <sitemapindex> of child sitemaps), or "txt" (plain-text sitemap with one URL per line). Sitemap records only.
httpStatus — HTTP status code returned (200 = success). Sitemap records only.
contentType — Content-Type header value (without charset). Sitemap records only.
byteCount — response body size in bytes. Sitemap records only.
urlCount — number of URL entries found inside the sitemap (or child sitemap links inside an index). Sitemap records only.
isCompressed — true when the body is gzipped (e.g. .xml.gz paths). Sitemap records only.
lastmod — first <lastmod> value found in the sitemap, or the per-URL <lastmod> when recordType: "url".
discoveredVia — "robots.txt", "common-path", or "sitemap-index" (parent index pointed here). Sitemap records only.
sitemapUrl — (URL records only) URL of the sitemap that contained this URL.
changefreq / priority / hreflang — (URL records only) standard sitemap fields when present.
scrapedAt — ISO-8601 timestamp of the discovery.

When to use this

Before crawling — feed the discovered sitemap URLs into a downstream crawler so you scrape only what's listed instead of guessing internal links.
SEO audits — confirm a site has a sitemap, that it points to the right pages, and that index files aren't broken.
Competitive research — measure how many URLs a site exposes, broken down by sitemap type (news / video / image / page).
Content migration — get a complete inventory of URLs declared by the source site.

FAQ

Does it need cookies, login, or a proxy? No. Sitemaps are public assets, designed to be machine-readable. The actor uses curl_cffi with a Chrome User-Agent and connects directly.

What if the site has no sitemap at all? The actor emits a single record {"type": "sitemap_sniffer_error", "reason": "no_sitemaps_found"} with a hint to check robots.txt manually. The run still completes successfully — empty datasets are not treated as failures.

Does it handle gzipped sitemaps? Yes. .xml.gz files are transparently decompressed in-memory before parsing.

How does it handle giant sites with thousands of sitemap files? maxSitemaps (default 50, max 1000) caps the run. The actor probes in priority order: robots.txt directives first, then the most common paths, then sitemap-index children. You'll get the most useful sitemaps first even if the cap stops the run early.

Can I get the URLs inside each sitemap? Yes — set emitUrls: true and the actor will also push one record per URL inside each discovered sitemap, with lastmod, changefreq, priority, and hreflang when present. maxUrls caps the total (default 10,000). Use recordType: "sitemap" vs recordType: "url" to disambiguate the two record shapes.

Is it safe to run on any website? Yes — the actor only fetches robots.txt and 16 well-known public paths. It makes at most ~17 requests on initial probe, plus one per sitemap-index child if followIndexes is enabled. No login pages, no admin paths, no API endpoints.

Sitemap URL Extractor — robots.txt + sitemap.xml Crawl

v0iddo/sitemap-url-extractor

Discover every URL a site exposes via its public sitemap chain. Reads robots.txt, follows Sitemap declarations, recursively descends sitemap-index files, extracts URLs with lastmod, changefreq, priority.

vøiddo

Sitemap Sniffer

maximedupre/sitemap-sniffer

Find sitemap files from website roots, domains, robots.txt, and direct sitemap URLs. Export sitemap metadata, URL counts, nested index depth, and optional URL inventory rows.

Maxime Dupré

Sitemap API

vivid_astronaut/sitemap

Fabio Suizu

Sitemap URL Intelligence

toronto_777/sitemap-url-intelligence

Discover robots.txt sitemap entries and classify public sitemap URLs by page type.

Steven Feng

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

273

Sitemap URL Extractor: Every URL, Recursive

thoob/sitemap-extractor

Reads sitemap.xml, sitemap index files, .gz compressed sitemaps, and robots.txt Sitemap directives, and returns one clean row per URL with lastmod, changefreq, and priority. Billed only per delivered URL.

Pono Data

Sitemap URL Extractor - List All URLs in a Sitemap

dltik/sitemap-url-extractor

Extract every URL from any XML sitemap, with lastmod, changefreq and priority. Resolves sitemap indexes recursively. Pass a sitemap.xml or just a site root to auto-discover its sitemaps. Pure HTTP, no browser — fast and cheap.

Walid

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

210

1.0

Sitemap to URL Crawler — Extract Sitemap.xml URLs

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Logiover

Sitemap URL Extractor - Get Every URL from sitemap.xml

eliai/sitemap-url-extractor

Extract every URL from any sitemap.xml, auto-following nested sitemap index files. Input: startUrls (sitemap URL). Output: JSON records with loc, lastmod, changefreq, priority, sourceSitemap. Cheap pay-per-result: $0.02 per sitemap parsed.