Pricing

from $4.00 / 1,000 results

🗺️ Sitemap Scraper & Analyzer

Extract thousands of URLs from complex nested XML sitemaps to audit website structure, HTTP status codes, and indexability for technical SEO.

Pricing

from $4.00 / 1,000 results

Rating

0.0

(0)

Developer

naoki anzai

Actor stats

Bookmarked

Total users

Monthly active users

22 days ago

Last modified

Sitemap Analyzer API | Coverage, Freshness & Indexability

Bypass heavy recursive web crawlers by directly extracting all underlying pages from any website's XML sitemaps. This Sitemap Scraper & Analyzer API is a purpose-built data extraction utility designed for SEO professionals and developers who need instant access to a domain's complete URL inventory. Instead of navigating confusing menus, simply use a sitemap URL to parse complex nested index files and extract thousands of links in seconds without spinning up a headless browser.

Users frequently schedule this scraper to run daily to monitor website structure changes, feed fresh URLs into search engines like Google, or supply seed lists to other web scraping tools. It serves as the perfect initial step before running heavier article or contact details extractors. Beyond just pulling links, this tool performs optional HEAD requests to verify if pages are active, returning detailed structural data including HTTP response codes, directory depth metrics, and file types.

Whether you are conducting a massive site audit, mapping a competitor's website architecture, or gathering raw scraped results to export and integrate into internal dashboards, this utility delivers clean data. By pulling direct from the sitemap.xml, you guarantee complete coverage of the websites intended for search visibility, capturing essential details like freshness metrics to keep your index perfectly synced.

Store Quickstart

Start with store-input.example.json to analyze one public sitemap with a small URL cap.
If that matches your SEO workflow, switch to store-input.templates.json and pick one of:
Quickstart (Dataset) for a fast structural audit
Large Site Audit for deeper coverage and status checks
Webhook Alert for change-driven monitoring

Key Features

🗺️ Sitemap index support — Handles nested sitemap structures
📊 Structure analysis — Top directories, depth distribution, file extensions
📅 Update pattern detection — lastmod distribution, changefreq analysis
🔗 Dead link checker — Optional HEAD request sampling to find broken URLs
🏗️ Architecture insights — Understand site structure from sitemap alone
📋 Bulk processing — Analyze multiple sitemaps per run

Use Cases

Who	Why
SEO agencies	Technical SEO audits — sitemap completeness and structure
Content strategists	Identify content update patterns and stale pages
Web developers	Verify sitemap structure before search engine submission
Competitive analysts	Map competitor site architecture from public sitemaps

Input

Field	Type	Default	Description
sitemapUrls	array	prefilled	URLs of sitemap.xml files to analyze. Auto-discovers /sitemap.xml if you provide just a domain.
maxUrls	integer	`5000`	Maximum number of URLs to process from each sitemap.
checkStatus	boolean	`false`	Send HEAD requests to check if URLs return 200. Slower but finds dead links.
delivery	string	`"dataset"`	How to deliver results. 'dataset' saves to Apify Dataset (recommended), 'webhook' sends to a URL.
webhookUrl	string	—	Webhook URL to send results to (only used when delivery is 'webhook'). Works with Slack, Discord, or any HTTP endpoint.
concurrency	integer	`3`	Maximum number of parallel requests. Higher = faster but may trigger rate limits.
dryRun	boolean	`false`	If true, runs without saving results or sending webhooks. Useful for testing.

Input Example

{
  "sitemapUrls": ["https://apify.com/sitemap.xml"],
  "maxUrls": 5000,
  "checkStatus": false,
  "concurrency": 3
}

Input Examples

Example: Single domain audit

{
  "domains": [
    "example.com"
  ]
}

Example: Bulk competitor sitemap snapshot

{
  "domains": [
    "competitor1.com",
    "competitor2.com"
  ],
  "expandIndexes": true
}

Example: Lastmod-only diff for drift

{
  "domains": [
    "example.com"
  ],
  "onlyLastmodChanged": true,
  "sinceDays": 30
}

Output

Field	Type	Description
`meta`	object
`results`	array
`results[].sitemapUrl`	string (url)
`results[].finalUrl`	string (url)
`results[].status`	string
`results[].analysis`	object
`results[].error`	null
`results[].checkedAt`	timestamp

Output Example

{
  "sitemapUrl": "https://apify.com/sitemap.xml",
  "status": "ok",
  "analysis": {
    "type": "urlset",
    "totalUrls": 1247,
    "structure": {
      "topDirectories": [
        { "path": "/store", "count": 890, "percentage": 71 },
        { "path": "/blog", "count": 156, "percentage": 13 }
      ],
      "depthDistribution": { "1": 45, "2": 890, "3": 312 }
    },
    "updateFrequency": {
      "lastModRange": { "oldest": "2023-01-15", "newest": "2026-02-20" },
      "urlsWithLastmod": 1100
    }
  }
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~sitemap-analyzer/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "sitemapUrls": ["https://apify.com/sitemap.xml"], "maxUrls": 5000, "checkStatus": false, "concurrency": 3 }'

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/sitemap-analyzer").call(run_input={
  "sitemapUrls": ["https://apify.com/sitemap.xml"],
  "maxUrls": 5000,
  "checkStatus": false,
  "concurrency": 3
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/sitemap-analyzer').call({
  "sitemapUrls": ["https://apify.com/sitemap.xml"],
  "maxUrls": 5000,
  "checkStatus": false,
  "concurrency": 3
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

Keep concurrency ≤ 5 when auditing production sites to avoid WAF rate-limit triggers.
Use webhook delivery for recurring cron runs — push only deltas to downstream systems.
Enable dryRun for cheap validation before committing to a paid cron schedule.
Results are dataset-first; use Apify API run-sync-get-dataset-items for instant JSON in CI pipelines.
Run a tiny URL count first, review the sample, then scale up — pay-per-event means you only pay for what you use.

FAQ

Is there a rate limit?

Built-in concurrency throttling keeps requests polite. For most public APIs this actor can run 1–10 parallel requests without issues.

What happens when the input URL is unreachable?

The actor records an error row with the failure reason — successful URLs keep processing.

Can I schedule recurring runs?

Yes — use Apify Schedules to run this actor on a cron (hourly, daily, weekly). Combine with webhook delivery for change alerts.

Does this actor respect robots.txt?

Yes — requests use a standard User-Agent and honor site rate limits. For aggressive audits, set a higher concurrency only on your own properties.

Can I integrate with Google Sheets or Airtable?

Use webhook delivery with a Zapier/Make/n8n catcher, or call the Apify REST API from Apps Script / Airtable automations.

URL/Link Tools cluster — explore related Apify tools:

🔗 URL Health Checker — Bulk-check HTTP status codes, redirects, SSL validity, and response times for thousands of URLs.
🔗 Broken Link Checker — Crawl websites to find broken links, 404 errors, and dead URLs.
🔗 URL Unshortener — Expand bit.
🏷️ Meta Tag Analyzer — Analyze meta tags, Open Graph, Twitter Cards, JSON-LD, and hreflang for any URL.
📚 Wayback Machine Checker — Check if URLs are archived on the Wayback Machine and find closest snapshots by date.
Schema.org Validator API | JSON-LD + Microdata — Validate JSON-LD and Microdata across multiple pages, score markup quality, and flag missing or malformed Schema.
Site Governance Monitor | Robots, Sitemap & Schema — Recurring robots.
RDAP Domain Monitor API | Ownership + Expiry — Monitor domain registration data via RDAP and track expiry, registrar, nameserver, and ownership changes in structured rows.
Domain Security Audit API | SSL Expiry, DMARC, Domain Expiry — Summary-first portfolio monitor for SSL expiry, DMARC/SPF/DKIM, domain expiry/ownership, and security headers with remediation-ready outputs.

Cost

Pay Per Event:

actor-start: $0.01 (flat fee per run)
dataset-item: $0.003 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.

Sitemap Scraper

scrapers-hub/sitemap-scraper

Sitemap scraper to crawl and extract URLs, pages, and structure from website sitemaps 🌐📊 Perfect for SEO analysis, website auditing, and data extraction. Fast, reliable, and scalable.

Scrapers Hub

Sitemap Crawler - XML Sitemap URL Extractor

miccho27/sitemap-crawler

Extract all URLs from XML sitemaps (including sitemap index) and optionally audit each page

Tatsuya Mizuno

Sitemap Analyzer — Parse, Validate & Check URLs

accurate_pouch/sitemap-analyzer

Parse XML sitemaps, extract all URLs, validate structure (priority, changefreq, lastmod), optionally check HTTP status of every URL. Supports sitemap indexes.

Manchitt Sanan

HTTP Status Codes and URL Checker

antonio_espresso/website-status-code-crawler

A HTTP Status Codes Crawler is a tool that scans a website and retrieves HTTP status codes for each page. This helps in diagnosing errors and optimizing technical SEO.

Antonio Blago

Sitemap Scraper

scrapevanta/sitemap-scraper

Sitemap Scraper extracts URLs, page metadata, update dates, images, and structured sitemap data from XML sitemaps. Ideal for SEO audits, website analysis, content discovery, indexing validation, competitor research, and large-scale web data collection.

ScrapeVanta

Sitemap URL Extractor & Parser

pink_comic/sitemap-url-extractor

Extract all URLs from XML sitemaps. Auto-discovers sitemaps via robots.txt. Handles nested sitemap indexes. Returns URL, last modified date, change frequency, priority, and image metadata. For SEO audits, content migration, and competitive analysis. No API key needed.

Ava Torres

SEO Sitemap & Broken Link Auditor

express_kingfisher/seo-sitemap-indexability-auditor

Comprehensive SEO audit: crawl sitemaps, find broken links, check redirects, analyze meta tags, validate schema markup, and detect indexability issues.

Prince Raj

Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Logiover

Sitemap Url Extractor

scrapers-hub/sitemap-url-extractor

Sitemap URL extractor to extract all URLs from XML sitemaps quickly and efficiently 🌐📄 Ideal for SEO audits, site analysis, and indexing workflows. Fast, accurate, and easy to use.

Scrapers Hub

Sitemap Structure Analyzer

onescales/sitemap-structure-analyzer

Analyze any website's sitemap in seconds using sitemap.xml data. Get URL counts by type (product, blog, docs), content freshness, URL patterns, and SEO anomalies — no page fetching required.