Sitemap Extractor
Pricing
from $0.10 / 1,000 results
Sitemap Extractor
This Apify Actor extracts all URLs from a website's sitemaps and checks their status codes via lightweight HTTP requests. It provides a clean list of valid links, acting as an ideal pre-processor to ensure your larger crawling projects target only active URLs.
Pricing
from $0.10 / 1,000 results
Rating
3.1
(5)
Developer
Apify
Maintained by ApifyActor stats
5
Bookmarked
183
Total users
35
Monthly active users
19 hours ago
Last modified
Categories
Share
This Actor is designed to bridge the gap between discovery and crawling. By traversing a website's sitemap.xml structure, it compiles a comprehensive list of all published pages and verifies their status before you commit resources to a full-scale scrape.
Features
- Recursive Sitemap Discovery: Automatically detects and traverses nested sitemaps (sitemap indexes).
- Efficiency: Uses HTTP HEAD requests for URL validation, which are significantly faster and consume less bandwidth than full GET requests.
- Proxy Support: Integrated with Apify Proxy to prevent rate limiting or blocking during the discovery phase.
- Detailed Output: Provides the final URL, the corresponding HTTP status code, and the date-time of the page's last modification.
How it Works
- Input: You provide one or more "Start URLs" pointing to the domain name root, sitemaps or sitemap indexes.
- Extraction: The Actor parses the XML, extracting both page URLs and links to further sitemaps.
- Validation: For every page URL found, the Actor performs a status check.
- Deduplication: The crawler uses unique keys to ensure that even if a URL appears in multiple sitemaps, it is only checked once.
Output
For each page URL, the Actor outputs:
| Field | Description |
|---|---|
url | The page URL from the sitemap. |
status | The HTTP status code returned by the HEAD request. |
lastmod | Best-effort last-modification time (ISO 8601). See the note below. |
A note on last-modification data
The lastmod field is a single best-effort timestamp derived from two sources, in this order of preference:
- The
<lastmod>tag declared for the URL in the sitemap. - The
Last-ModifiedHTTP header returned by the page (used only when the sitemap has no<lastmod>).
We cannot guarantee that this information is available. Both sources are optional: many sitemaps omit <lastmod> entirely, and a lot of servers don't send a Last-Modified header (this is especially common for dynamically generated pages). When neither source provides a value, lastmod is null. Even when present, the value is self-reported by the site and may not reflect the true last-modification time of the content.
Usage
This Actor is ideal for:
- Pre-crawling filter: Generating a "clean" list of URLs for actors like Website Content Crawler or Web Scraper.
- SEO Audits: Quickly identifying 404 Not Found or 500 Server Error pages listed in your sitemap.
- Site Mapping: Getting a high-level overview of a site's architecture.
Configuration
| Field | Description |
|---|---|
| Start URLs | Just a domain name or a list of sitemap XML URLs to start from. |
| Proxy configuration | Settings for Apify Proxies. |