Sitemap Sniffer
Pricing
from $1.00 / 1,000 results
Sitemap Sniffer
Discover every sitemap file for a website. Reads robots.txt for Sitemap directives, probes common sitemap paths, and recursively unpacks sitemap-index files. HTTP-only, no proxy or cookies needed.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(14)
Developer
Crawler Bros
Maintained by CommunityActor stats
14
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Discover every sitemap file for a website — automatically. Reads robots.txt for Sitemap: directives, probes 16 common sitemap paths (Yoast, WordPress, sitemap-index, gzipped variants), and recursively unpacks sitemap-index files. HTTP-only, no proxy, no cookies, no API key.
What it does
You point this actor at any website and get back a structured list of every sitemap file it could find:
/robots.txtdirectives — the canonical place sites declare their sitemaps.- Common sitemap paths —
sitemap.xml,sitemap_index.xml,wp-sitemap.xml,post-sitemap.xml,sitemap.xml.gz, and 11 more. - Sitemap-index expansion — when an index points to child sitemaps, the actor follows it (one level deep) and emits each child too.
For each discovered sitemap, the actor reports the URL, type (sitemap / sitemap_index / txt), HTTP status, content type, byte size, URL count (parsed from XML), gzip flag, last-modified date if present, and how it was discovered.
Input
| Field | Type | Default | Description |
|---|---|---|---|
url | string (required) | https://apify.com | Root URL or bare host (e.g. example.com). The actor extracts the origin and probes that. |
followIndexes | boolean | true | When a sitemap-index is found, also fetch and emit the child sitemap URLs it points to. |
maxSitemaps | integer | 50 (1–1000) | Hard cap on the number of records emitted. Probing stops once this many are discovered. |
fetchUrlCounts | boolean | true | Parse each sitemap and report the number of URLs it contains. Disable to skip the full-body download. |
emitUrls | boolean | false | When true, the actor also emits one record per URL found inside each discovered sitemap (with lastmod, changefreq, priority, hreflang when present). |
maxUrls | integer | 10000 (1–100000) | Hard cap on per-URL records when emitUrls: true. Has no effect when emitUrls: false. |
userAgent | string (optional) | Chrome 131 | Override only if a target server filters by UA. |
Example input
{"url": "https://www.bbc.com","followIndexes": true,"maxSitemaps": 50,"fetchUrlCounts": true}
Output
By default, one record per discovered sitemap. When emitUrls: true, the dataset also contains one record per URL found inside each sitemap. The two shapes can be disambiguated by recordType. Empty fields are omitted (no nulls).
Sitemap record (recordType: "sitemap")
{"recordType": "sitemap","url": "https://www.bbc.com/sitemap.xml","domainHost": "www.bbc.com","type": "sitemap_index","httpStatus": 200,"contentType": "application/xml","byteCount": 13450,"urlCount": 78,"isCompressed": false,"lastmod": "2024-12-15","discoveredVia": "robots.txt","scrapedAt": "2024-12-16T14:23:11+00:00"}
URL record (recordType: "url", only when emitUrls: true)
{"recordType": "url","url": "https://www.bbc.com/news/articles/c-12345","domainHost": "www.bbc.com","sitemapUrl": "https://www.bbc.com/sitemaps/news/sitemap.xml","lastmod": "2024-12-15","changefreq": "hourly","priority": 0.8,"hreflang": [{"lang": "en-GB", "href": "https://www.bbc.com/news/articles/c-12345"}],"scrapedAt": "2024-12-16T14:23:11+00:00"}
Output fields
recordType—"sitemap"for sitemap-file records (always emitted), or"url"for per-URL records (only whenemitUrls: true).url— absolute URL of the sitemap (or the URL referenced inside a sitemap, whenrecordType: "url").domainHost— parsed hostname ofurl(handy for grouping records by site when ingesting from multiple runs).type—"sitemap"(a<urlset>),"sitemap_index"(a<sitemapindex>of child sitemaps), or"txt"(plain-text sitemap with one URL per line). Sitemap records only.httpStatus— HTTP status code returned (200 = success). Sitemap records only.contentType—Content-Typeheader value (without charset). Sitemap records only.byteCount— response body size in bytes. Sitemap records only.urlCount— number of URL entries found inside the sitemap (or child sitemap links inside an index). Sitemap records only.isCompressed—truewhen the body is gzipped (e.g..xml.gzpaths). Sitemap records only.lastmod— first<lastmod>value found in the sitemap, or the per-URL<lastmod>whenrecordType: "url".discoveredVia—"robots.txt","common-path", or"sitemap-index"(parent index pointed here). Sitemap records only.sitemapUrl— (URL records only) URL of the sitemap that contained this URL.changefreq/priority/hreflang— (URL records only) standard sitemap fields when present.scrapedAt— ISO-8601 timestamp of the discovery.
When to use this
- Before crawling — feed the discovered sitemap URLs into a downstream crawler so you scrape only what's listed instead of guessing internal links.
- SEO audits — confirm a site has a sitemap, that it points to the right pages, and that index files aren't broken.
- Competitive research — measure how many URLs a site exposes, broken down by sitemap type (news / video / image / page).
- Content migration — get a complete inventory of URLs declared by the source site.
FAQ
Does it need cookies, login, or a proxy?
No. Sitemaps are public assets, designed to be machine-readable. The actor uses curl_cffi with a Chrome User-Agent and connects directly.
What if the site has no sitemap at all?
The actor emits a single record {"type": "sitemap_sniffer_error", "reason": "no_sitemaps_found"} with a hint to check robots.txt manually. The run still completes successfully — empty datasets are not treated as failures.
Does it handle gzipped sitemaps?
Yes. .xml.gz files are transparently decompressed in-memory before parsing.
How does it handle giant sites with thousands of sitemap files?
maxSitemaps (default 50, max 1000) caps the run. The actor probes in priority order: robots.txt directives first, then the most common paths, then sitemap-index children. You'll get the most useful sitemaps first even if the cap stops the run early.
Can I get the URLs inside each sitemap?
Yes — set emitUrls: true and the actor will also push one record per URL inside each discovered sitemap, with lastmod, changefreq, priority, and hreflang when present. maxUrls caps the total (default 10,000). Use recordType: "sitemap" vs recordType: "url" to disambiguate the two record shapes.
Is it safe to run on any website?
Yes — the actor only fetches robots.txt and 16 well-known public paths. It makes at most ~17 requests on initial probe, plus one per sitemap-index child if followIndexes is enabled. No login pages, no admin paths, no API endpoints.