Sitemap Sniffer avatar

Sitemap Sniffer

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Sitemap Sniffer

Sitemap Sniffer

Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(14)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

14

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Discover every sitemap file for a website — automatically. Reads robots.txt for Sitemap: directives, probes 16 common sitemap paths (Yoast, WordPress, sitemap-index, gzipped variants), and recursively unpacks sitemap-index files. HTTP-only, no proxy, no cookies, no API key.

What it does

You point this actor at any website and get back a structured list of every sitemap file it could find:

  • /robots.txt directives — the canonical place sites declare their sitemaps.
  • Common sitemap pathssitemap.xml, sitemap_index.xml, wp-sitemap.xml, post-sitemap.xml, sitemap.xml.gz, and 11 more.
  • Sitemap-index expansion — when an index points to child sitemaps, the actor follows it (one level deep) and emits each child too.

For each discovered sitemap, the actor reports the URL, type (sitemap / sitemap_index / txt), HTTP status, content type, byte size, URL count (parsed from XML), gzip flag, last-modified date if present, and how it was discovered.

Input

FieldTypeDefaultDescription
urlstring (required)https://apify.comRoot URL or bare host (e.g. example.com). The actor extracts the origin and probes that.
followIndexesbooleantrueWhen a sitemap-index is found, also fetch and emit the child sitemap URLs it points to.
maxSitemapsinteger50 (1–1000)Hard cap on the number of records emitted. Probing stops once this many are discovered.
fetchUrlCountsbooleantrueParse each sitemap and report the number of URLs it contains. Disable to skip the full-body download.
emitUrlsbooleanfalseWhen true, the actor also emits one record per URL found inside each discovered sitemap (with lastmod, changefreq, priority, hreflang when present).
maxUrlsinteger10000 (1–100000)Hard cap on per-URL records when emitUrls: true. Has no effect when emitUrls: false.
userAgentstring (optional)Chrome 131Override only if a target server filters by UA.

Example input

{
"url": "https://www.bbc.com",
"followIndexes": true,
"maxSitemaps": 50,
"fetchUrlCounts": true
}

Output

By default, one record per discovered sitemap. When emitUrls: true, the dataset also contains one record per URL found inside each sitemap. The two shapes can be disambiguated by recordType. Empty fields are omitted (no nulls).

Sitemap record (recordType: "sitemap")

{
"recordType": "sitemap",
"url": "https://www.bbc.com/sitemap.xml",
"domainHost": "www.bbc.com",
"type": "sitemap_index",
"httpStatus": 200,
"contentType": "application/xml",
"byteCount": 13450,
"urlCount": 78,
"isCompressed": false,
"lastmod": "2024-12-15",
"discoveredVia": "robots.txt",
"scrapedAt": "2024-12-16T14:23:11+00:00"
}

URL record (recordType: "url", only when emitUrls: true)

{
"recordType": "url",
"url": "https://www.bbc.com/news/articles/c-12345",
"domainHost": "www.bbc.com",
"sitemapUrl": "https://www.bbc.com/sitemaps/news/sitemap.xml",
"lastmod": "2024-12-15",
"changefreq": "hourly",
"priority": 0.8,
"hreflang": [{"lang": "en-GB", "href": "https://www.bbc.com/news/articles/c-12345"}],
"scrapedAt": "2024-12-16T14:23:11+00:00"
}

Output fields

  • recordType"sitemap" for sitemap-file records (always emitted), or "url" for per-URL records (only when emitUrls: true).
  • url — absolute URL of the sitemap (or the URL referenced inside a sitemap, when recordType: "url").
  • domainHost — parsed hostname of url (handy for grouping records by site when ingesting from multiple runs).
  • type"sitemap" (a <urlset>), "sitemap_index" (a <sitemapindex> of child sitemaps), or "txt" (plain-text sitemap with one URL per line). Sitemap records only.
  • httpStatus — HTTP status code returned (200 = success). Sitemap records only.
  • contentTypeContent-Type header value (without charset). Sitemap records only.
  • byteCount — response body size in bytes. Sitemap records only.
  • urlCount — number of URL entries found inside the sitemap (or child sitemap links inside an index). Sitemap records only.
  • isCompressedtrue when the body is gzipped (e.g. .xml.gz paths). Sitemap records only.
  • lastmod — first <lastmod> value found in the sitemap, or the per-URL <lastmod> when recordType: "url".
  • discoveredVia"robots.txt", "common-path", or "sitemap-index" (parent index pointed here). Sitemap records only.
  • sitemapUrl — (URL records only) URL of the sitemap that contained this URL.
  • changefreq / priority / hreflang — (URL records only) standard sitemap fields when present.
  • scrapedAt — ISO-8601 timestamp of the discovery.

When to use this

  • Before crawling — feed the discovered sitemap URLs into a downstream crawler so you scrape only what's listed instead of guessing internal links.
  • SEO audits — confirm a site has a sitemap, that it points to the right pages, and that index files aren't broken.
  • Competitive research — measure how many URLs a site exposes, broken down by sitemap type (news / video / image / page).
  • Content migration — get a complete inventory of URLs declared by the source site.

FAQ

Does it need cookies, login, or a proxy? No. Sitemaps are public assets, designed to be machine-readable. The actor uses curl_cffi with a Chrome User-Agent and connects directly.

What if the site has no sitemap at all? The actor emits a single record {"type": "sitemap_sniffer_error", "reason": "no_sitemaps_found"} with a hint to check robots.txt manually. The run still completes successfully — empty datasets are not treated as failures.

Does it handle gzipped sitemaps? Yes. .xml.gz files are transparently decompressed in-memory before parsing.

How does it handle giant sites with thousands of sitemap files? maxSitemaps (default 50, max 1000) caps the run. The actor probes in priority order: robots.txt directives first, then the most common paths, then sitemap-index children. You'll get the most useful sitemaps first even if the cap stops the run early.

Can I get the URLs inside each sitemap? Yes — set emitUrls: true and the actor will also push one record per URL inside each discovered sitemap, with lastmod, changefreq, priority, and hreflang when present. maxUrls caps the total (default 10,000). Use recordType: "sitemap" vs recordType: "url" to disambiguate the two record shapes.

Is it safe to run on any website? Yes — the actor only fetches robots.txt and 16 well-known public paths. It makes at most ~17 requests on initial probe, plus one per sitemap-index child if followIndexes is enabled. No login pages, no admin paths, no API endpoints.