Sitemap Extractor
Pricing
from $0.10 / 1,000 results
Go to Apify Store
Sitemap Extractor
This Apify Actor extracts all URLs from a website's sitemaps and checks their status codes via lightweight HTTP requests. It provides a clean list of valid links, acting as an ideal pre-processor to ensure your larger crawling projects target only active URLs.
Pricing
from $0.10 / 1,000 results
Rating
0.0
(0)
Developer

Apify
Maintained by Apify
Actor stats
0
Bookmarked
4
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
This Actor is designed to bridge the gap between discovery and crawling. By traversing a website's sitemap.xml structure, it compiles a comprehensive list of all published pages and verifies their status before you commit resources to a full-scale scrape.
Features
- Recursive Sitemap Discovery: Automatically detects and traverses nested sitemaps (sitemap indexes).
- Efficiency: Uses HTTP HEAD requests for URL validation, which are significantly faster and consume less bandwidth than full GET requests.
- Proxy Support: Integrated with Apify Proxy to prevent rate limiting or blocking during the discovery phase.
- Detailed Output: Provides the final URL and the corresponding HTTP status code.
How it Works
- Input: You provide one or more "Start URLs" pointing to the domain name root, sitemaps or sitemap indexes.
- Extraction: The Actor parses the XML, extracting both page URLs and links to further sitemaps.
- Validation: For every page URL found, the Actor performs a status check.
- Deduplication: The crawler uses unique keys to ensure that even if a URL appears in multiple sitemaps, it is only checked once.
Usage
This Actor is ideal for:
- Pre-crawling filter: Generating a "clean" list of URLs for actors like Website Content Crawler or Web Scraper.
- SEO Audits: Quickly identifying 404 Not Found or 500 Server Error pages listed in your sitemap.
- Site Mapping: Getting a high-level overview of a site's architecture.
Configuration
| Field | Description |
|---|---|
| Start URLs | Just a domain name or a list of sitemap XML URLs to start from. |
| Proxy configuration | Settings for Apify Proxies. |