XML Sitemap Finder & Extractor API avatar

XML Sitemap Finder & Extractor API

Pricing

from $1.00 / 1,000 sitemap lookups

Go to Apify Store
XML Sitemap Finder & Extractor API

XML Sitemap Finder & Extractor API

Find and extract all XML sitemaps for any domain. Automatically parses robots.txt, scans HTML tags, and recursively follows indexes. Perfect for SEO & web scraping.

Pricing

from $1.00 / 1,000 sitemap lookups

Rating

0.0

(0)

Developer

Andok

Andok

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Sitemap Finder

Discover all XML sitemaps for any website by checking common file paths, parsing robots.txt directives, and scanning HTML content. Provide one or more URLs and get a complete inventory of every sitemap, validated and classified.

Features

  • Multi-source discovery — checks 15+ common sitemap paths, robots.txt directives, and HTML <a> / <link> tags
  • Batch processing — process multiple websites in a single run with configurable concurrency
  • Recursive index traversal — follows sitemap index files to discover all nested child sitemaps
  • Gzip support — handles .xml.gz compressed sitemaps automatically
  • XML validation — verifies sitemaps contain valid XML and classifies them as index or urlset
  • Rich metadata — reports URL count per sitemap, last modified date, discovery source, and validation status

Input

FieldTypeDefaultDescription
urlsstring[]Website URLs to check (e.g., ["https://example.com"])
urlstringSingle URL for backward compatibility. Merged into urls if both are set.
findAllbooleantrueFind all sitemaps or stop after the first one
followIndexesbooleantrueRecursively follow sitemap index files to discover child sitemaps
verifybooleantrueVerify sitemaps are valid XML and extract metadata
timeoutinteger10HTTP request timeout in seconds
concurrencyinteger3Max concurrent website processing (1–20)

Example Input

{
"urls": ["https://example.com", "https://crawlee.dev"],
"findAll": true,
"followIndexes": true,
"verify": true,
"timeout": 10,
"concurrency": 3
}

Output

Results are stored in the default dataset. Each record represents a discovered sitemap:

{
"websiteUrl": "https://crawlee.dev",
"sitemapUrl": "https://crawlee.dev/sitemap.xml",
"type": "index",
"urlCount": 4,
"lastModified": "2024-12-15T10:30:00Z",
"isValid": true,
"source": "common-location"
}
FieldDescription
websiteUrlThe input website URL
sitemapUrlFull URL of the discovered sitemap
typeSitemap type: index (contains other sitemaps), urlset (contains page URLs), or unknown
urlCountNumber of entries in the sitemap (child sitemaps for indexes, page URLs for urlsets)
lastModifiedMost recent <lastmod> date found in the sitemap
isValidWhether the sitemap contains valid XML
sourceHow the sitemap was discovered: common-location, robots.txt, html-content, or index:<parent-url>
errorError message if the lookup failed (only present on error records)

When no sitemaps are found for a URL, a single record is returned with sitemapUrl: null and an appropriate error message.

API Usage

Call the actor via the API and retrieve results from the default dataset:

curl "https://api.apify.com/v2/acts/YOUR_USERNAME~find-sitemap-from-url/run-sync-get-dataset-items?token=YOUR_TOKEN" \
-X POST \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"]}'

Pricing

This actor uses pay-per-event (PPE) pricing:

EventCost
sitemap-lookup$0.001 per URL processed

You are charged once per input URL, regardless of how many sitemaps are discovered for that URL. There are no additional platform fees beyond the per-event charge.

Use Cases

  • SEO auditing — verify sitemap coverage and freshness across your sites
  • Web scraping — discover all available sitemaps before crawling to plan efficient scraping
  • Site monitoring — track sitemap changes, URL counts, and last modified dates over time
  • Competitor analysis — map out a competitor's site structure via their sitemaps
  • Migration validation — confirm sitemaps are correctly set up after a site migration
  • Content indexing — find all content endpoints for search engine optimization

Discovery Methods

The actor uses three complementary discovery strategies:

  1. Common paths — checks 15+ well-known sitemap file locations (/sitemap.xml, /wp-sitemap.xml, /sitemap_index.xml, etc.)
  2. robots.txt — parses Sitemap: directives from the site's robots.txt file
  3. HTML scanning — searches the homepage HTML for <a> and <link> tags referencing sitemaps

When followIndexes is enabled, any discovered sitemap index is recursively expanded to reveal all child sitemaps.