XML Sitemap Finder & Extractor API
Pricing
from $1.00 / 1,000 sitemap lookups
XML Sitemap Finder & Extractor API
Find and extract all XML sitemaps for any domain. Automatically parses robots.txt, scans HTML tags, and recursively follows indexes. Perfect for SEO & web scraping.
Pricing
from $1.00 / 1,000 sitemap lookups
Rating
0.0
(0)
Developer
Andok
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
Sitemap Finder
Discover all XML sitemaps for any website by checking common file paths, parsing robots.txt directives, and scanning HTML content. Provide one or more URLs and get a complete inventory of every sitemap, validated and classified.
Features
- Multi-source discovery — checks 15+ common sitemap paths,
robots.txtdirectives, and HTML<a>/<link>tags - Batch processing — process multiple websites in a single run with configurable concurrency
- Recursive index traversal — follows sitemap index files to discover all nested child sitemaps
- Gzip support — handles
.xml.gzcompressed sitemaps automatically - XML validation — verifies sitemaps contain valid XML and classifies them as
indexorurlset - Rich metadata — reports URL count per sitemap, last modified date, discovery source, and validation status
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | — | Website URLs to check (e.g., ["https://example.com"]) |
url | string | — | Single URL for backward compatibility. Merged into urls if both are set. |
findAll | boolean | true | Find all sitemaps or stop after the first one |
followIndexes | boolean | true | Recursively follow sitemap index files to discover child sitemaps |
verify | boolean | true | Verify sitemaps are valid XML and extract metadata |
timeout | integer | 10 | HTTP request timeout in seconds |
concurrency | integer | 3 | Max concurrent website processing (1–20) |
Example Input
{"urls": ["https://example.com", "https://crawlee.dev"],"findAll": true,"followIndexes": true,"verify": true,"timeout": 10,"concurrency": 3}
Output
Results are stored in the default dataset. Each record represents a discovered sitemap:
{"websiteUrl": "https://crawlee.dev","sitemapUrl": "https://crawlee.dev/sitemap.xml","type": "index","urlCount": 4,"lastModified": "2024-12-15T10:30:00Z","isValid": true,"source": "common-location"}
| Field | Description |
|---|---|
websiteUrl | The input website URL |
sitemapUrl | Full URL of the discovered sitemap |
type | Sitemap type: index (contains other sitemaps), urlset (contains page URLs), or unknown |
urlCount | Number of entries in the sitemap (child sitemaps for indexes, page URLs for urlsets) |
lastModified | Most recent <lastmod> date found in the sitemap |
isValid | Whether the sitemap contains valid XML |
source | How the sitemap was discovered: common-location, robots.txt, html-content, or index:<parent-url> |
error | Error message if the lookup failed (only present on error records) |
When no sitemaps are found for a URL, a single record is returned with sitemapUrl: null and an appropriate error message.
API Usage
Call the actor via the API and retrieve results from the default dataset:
curl "https://api.apify.com/v2/acts/YOUR_USERNAME~find-sitemap-from-url/run-sync-get-dataset-items?token=YOUR_TOKEN" \-X POST \-H "Content-Type: application/json" \-d '{"urls": ["https://example.com"]}'
Pricing
This actor uses pay-per-event (PPE) pricing:
| Event | Cost |
|---|---|
sitemap-lookup | $0.001 per URL processed |
You are charged once per input URL, regardless of how many sitemaps are discovered for that URL. There are no additional platform fees beyond the per-event charge.
Use Cases
- SEO auditing — verify sitemap coverage and freshness across your sites
- Web scraping — discover all available sitemaps before crawling to plan efficient scraping
- Site monitoring — track sitemap changes, URL counts, and last modified dates over time
- Competitor analysis — map out a competitor's site structure via their sitemaps
- Migration validation — confirm sitemaps are correctly set up after a site migration
- Content indexing — find all content endpoints for search engine optimization
Discovery Methods
The actor uses three complementary discovery strategies:
- Common paths — checks 15+ well-known sitemap file locations (
/sitemap.xml,/wp-sitemap.xml,/sitemap_index.xml, etc.) - robots.txt — parses
Sitemap:directives from the site's robots.txt file - HTML scanning — searches the homepage HTML for
<a>and<link>tags referencing sitemaps
When followIndexes is enabled, any discovered sitemap index is recursively expanded to reveal all child sitemaps.