Sitemap Audit avatar

Sitemap Audit

Pricing

from $2.00 / 1,000 results

Go to Apify Store
Sitemap Audit

Sitemap Audit

Get a Sitemap Health Score (0-100) for any website. Discover, parse, and validate XML sitemaps. Find 404s, redirects, canonical mismatches, noindex conflicts, hreflang issues, missing pages and estimate crawl budget waste.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Andy Page

Andy Page

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Sitemap Audit: Health Score, SEO Validator & Missing Pages Finder

Audit one or many websites' XML sitemaps end-to-end. Get a Sitemap Health Score (0-100) for each domain, validate every URL for 404s, redirects, and robots.txt conflicts, detect canonical mismatches and noindex-in-sitemap conflicts, find pages missing from sitemaps, estimate crawl budget waste, and get prioritized SEO recommendations to fix everything.

Why Use This Actor?

  • Health Score — A single 0-100 metric that instantly tells you how healthy your sitemaps are, with error/warning counts and a detailed breakdown
  • Auto-discovery — Finds all sitemaps from robots.txt and 6 common sitemap paths automatically
  • Full URL validation — HEAD-checks every URL for 404s, 301/302 redirects, 5xx server errors, and robots.txt conflicts
  • Deep page inspection — Fetches a sample of pages to detect canonical tag mismatches, noindex directives on URLs in the sitemap, and hreflang inconsistencies
  • Missing pages finder — Crawls your site to find linked pages that aren't in any sitemap
  • Crawl budget analysis — Estimates how much of Google's crawl budget is wasted on broken, redirected, or blocked URLs
  • Prioritized recommendations — Get actionable fixes sorted by impact: critical → high → medium → low
  • Letter grade (A-F) — Instantly sort and filter sites by grade. Perfect for batch audits and cold outreach
  • Outreach-ready output — Pre-written executive summary and cold email snippet for every domain. Copy-paste into your outreach sequence
  • Sitemap freshness analysis — Detects stale sitemaps (e.g., "hasn't been updated in 247 days") and protocol mismatches (HTTP vs HTTPS)
  • Hreflang support — Full hreflang validation for international sites, including cross-checking XML annotations against on-page tags
  • CSV export — Flattened output for Google Sheets, Excel, or data pipelines

Features

FeatureDescription
Sitemap Health Score0-100 score measuring overall sitemap quality
Auto-DiscoveryFinds sitemaps from robots.txt + 6 common paths
XML/Gzip ParsingHandles urlset, sitemap index, gzipped, and image sitemaps
URL ValidationHEAD-checks every URL for status codes, redirects, and robots.txt conflicts
Duplicate DetectionFinds duplicate URLs (case-insensitive, trailing slash normalization)
Deep Page InspectionChecks canonical tags, meta robots noindex, and hreflang on actual pages
Missing Pages FinderCrawls the site to discover pages not in any sitemap
Crawl Budget WasteEstimates wasted crawl budget from 404s, redirects, blocked, and noindexed URLs
Hreflang ValidationCross-checks sitemap hreflang annotations against on-page hreflang tags
Error ClassificationCategorizes issues as errors (critical) or warnings (non-critical)
Recommendations EnginePrioritized, actionable SEO fix recommendations
URL DistributionShows URL breakdown by path prefix and priority value
Letter Grade (A-F)Instant letter grade for quick sorting in spreadsheets
Executive Summary2-3 sentence audit summary — ready to paste into emails or reports
Outreach SnippetPre-written cold email paragraph for each domain
Top 3 IssuesQuick wins list with priority, count, effort, and impact
Issue Severity CountsCritical/High/Medium/Low counts at top level for easy filtering
Sitemap FreshnessDays since last update, stale URL count, freshness verdict
Protocol CheckDetects HTTP/HTTPS mismatches in sitemap URLs
Proxy SupportOptional Apify proxy integration for rate-limited sites
CSV ExportFlattened output for spreadsheet analysis

Input Parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]YesWebsites to audit (e.g., ["https://example.com", "https://test.org"]). One report per domain.
sitemapUrlsstring[]No[]Explicit sitemap URLs to audit (added to auto-discovered)
autoDiscoverbooleanNotrueAuto-discover sitemaps from robots.txt and common paths
validateUrlsbooleanNotrueHEAD-check every URL for status codes and redirects
deepInspectionbooleanNotrueInspect pages for canonical, noindex, and hreflang issues
inspectionSampleSizeintegerNo100Pages to inspect (10-500, used when deepInspection is on)
crawlDepthintegerNo3Crawl depth for missing pages discovery (0 = disabled)
maxPagesToCrawlintegerNo500Max pages to crawl for missing pages (50-5000)
maxUrlsToValidateintegerNo2000Max URLs to validate via HEAD requests per domain (100-50000). If the sitemap has more, a random sample is validated.
maxUrlsToParseintegerNo50000Max URLs to collect during sitemap parsing (1000-500000). Prevents out-of-memory on enterprise-scale sitemaps.
maxConcurrencyintegerNo10Max concurrent HTTP requests (1-50)
outputFormatstringNo"json"Output format: json or csv
proxyConfigobjectNoApify residentialProxy configuration for HTTP requests

Example Input

{
"urls": ["https://example.com", "https://blog.example.com"],
"validateUrls": true,
"deepInspection": true,
"inspectionSampleSize": 100,
"crawlDepth": 3,
"maxPagesToCrawl": 500,
"maxConcurrency": 10,
"outputFormat": "json"
}

Output Schema

Each domain produces one comprehensive audit report in the dataset (so auditing 5 domains produces 5 dataset items):

{
"domain": "example.com",
"url": "https://example.com",
"auditDate": "2026-02-17T12:00:00.000Z",
"healthScore": 78,
"grade": "C",
"gradeLabel": "Needs Improvement",
"hasSitemap": true,
"totalSitemaps": 3,
"totalUrls": 1250,
"executiveSummary": "example.com scored 78/100 (Grade: C) on its sitemap health audit with 1,250 URLs across its sitemaps. We found 33 errors, 37 warnings, 42 pages missing from the sitemap. This wastes approximately 340 crawl requests/month (5.2% of estimated budget).",
"outreachSnippet": "I ran a quick technical SEO audit on example.com and found some issues with your XML sitemap that could be hurting your search rankings. Your sitemap scored 78/100 (Grade C). Top issues: remove dead URLs from sitemap, fix robots.txt conflicts, and update redirecting URLs in sitemap. This is wasting roughly 340 Google crawl requests per month on dead or misconfigured URLs. We also found 42 pages on your site that aren't in the sitemap at all.",
"topIssues": [
{
"title": "Remove dead URLs from sitemap",
"description": "25 URL(s) return 404/410 status. Remove them from the sitemap to stop wasting crawl budget.",
"priority": "critical",
"count": 25,
"effort": "medium",
"impact": "high"
},
{
"title": "Fix robots.txt conflicts",
"description": "3 URL(s) are blocked by robots.txt but included in the sitemap.",
"priority": "critical",
"count": 3,
"effort": "low",
"impact": "high"
},
{
"title": "Add missing pages to sitemap",
"description": "42 page(s) found on the site are missing from the sitemap.",
"priority": "high",
"count": 42,
"effort": "medium",
"impact": "medium"
}
],
"issueSeverityCounts": {
"critical": 2,
"high": 3,
"medium": 4,
"low": 2,
"total": 11
},
"sitemapFreshness": {
"hasLastmod": true,
"urlsWithLastmod": 1100,
"urlsWithoutLastmod": 150,
"pctWithoutLastmod": 12,
"newestLastmod": "2026-01-15T00:00:00.000Z",
"oldestLastmod": "2024-03-01T00:00:00.000Z",
"daysSinceLastUpdate": 33,
"staleUrlCount": 200,
"staleUrlPct": 16,
"veryStaleUrlCount": 50,
"freshnessVerdict": "Sitemap was updated 33 days ago. Consider more frequent updates."
},
"protocolIssues": {
"httpUrlCount": 0,
"httpsUrlCount": 1250,
"hasMixedProtocol": false,
"siteUsesHttps": true,
"sampleHttpUrls": [],
"verdict": "All sitemap URLs use HTTPS. ✓"
},
"sitemapsDiscovered": [
"https://example.com/sitemap.xml",
"https://example.com/post-sitemap.xml",
"https://example.com/page-sitemap.xml"
],
"sitemapDetails": [
{
"url": "https://example.com/sitemap.xml",
"type": "index",
"urlCount": 0,
"imageCount": 0,
"hreflangCount": 0,
"byteLength": 1024,
"gzipped": false,
"error": null
},
{
"url": "https://example.com/post-sitemap.xml",
"type": "urlset",
"urlCount": 850,
"imageCount": 200,
"hreflangCount": 0,
"byteLength": 45000,
"gzipped": false,
"error": null
}
],
"urlValidation": {
"total": 1250,
"status2xx": 1180,
"status3xx": 35,
"status4xx": 25,
"status5xx": 5,
"blockedByRobots": 3,
"duplicateCount": 2,
"details": [
{
"url": "https://example.com/old-page",
"statusCode": 404,
"error": null
},
{
"url": "https://example.com/moved-page",
"statusCode": 301,
"redirectUrl": "https://example.com/new-page",
"isRedirect": true
}
]
},
"deepInspection": {
"sampleSize": 100,
"totalUrls": 1250,
"summary": {
"canonicalMatch": 85,
"canonicalMismatch": 8,
"canonicalMissing": 7,
"noindexCount": 3,
"hreflangMissingOnPage": 0
},
"details": [
{
"url": "https://example.com/product/123",
"canonical": {
"status": "mismatch",
"expected": "https://example.com/product/123",
"found": "https://example.com/product/123?ref=home"
},
"noindex": { "hasNoindex": false },
"hreflang": { "missingOnPage": [] }
}
]
},
"missingPages": {
"crawledCount": 500,
"missingCount": 42,
"pages": [
"https://example.com/about",
"https://example.com/contact",
"https://example.com/blog/popular-post"
]
},
"errors": [
{
"type": "url_404",
"severity": "error",
"message": "URL returns 404",
"url": "https://example.com/old-page",
"statusCode": 404
}
],
"errorCount": 33,
"warnings": [
{
"type": "url_redirect",
"severity": "warning",
"message": "URL redirects to another location",
"url": "https://example.com/moved-page",
"redirectUrl": "https://example.com/new-page"
}
],
"warningCount": 37,
"crawlBudgetWaste": {
"wastedUrls": 70,
"wastePercentage": 5.6,
"siteSizeTier": "medium",
"breakdown": {
"notFound": 25,
"redirects": 35,
"blocked": 3,
"noindex": 7
}
},
"urlDistribution": {
"byPathPrefix": {
"/blog": 400,
"/product": 350,
"/category": 200,
"/page": 150,
"/other": 150
},
"byPriority": {
"1.0": 10,
"0.8": 200,
"0.5": 800,
"none": 240
}
},
"recommendations": [
{
"priority": "critical",
"category": "broken_urls",
"title": "Remove 25 URLs returning 404 from sitemaps",
"description": "25 URLs in your sitemaps return HTTP 404 (Not Found). These waste crawl budget and signal poor sitemap maintenance to search engines.",
"affectedUrls": 25
},
{
"priority": "high",
"category": "redirects",
"title": "Update 35 redirecting URLs to final destinations",
"description": "35 URLs redirect to other locations. Replace them with the final destination URLs to avoid wasting crawl budget on redirect chains.",
"affectedUrls": 35
},
{
"priority": "medium",
"category": "missing_pages",
"title": "Add 42 linked pages to your sitemap",
"description": "42 pages are linked from your site but missing from all sitemaps. Adding them helps search engines discover and index these pages faster.",
"affectedUrls": 42
}
],
"robotsTxtFound": true,
"settings": {
"validateUrls": true,
"deepInspection": true,
"crawlDepth": 3,
"maxConcurrency": 10,
"maxPagesToCrawl": 500,
"maxUrlsToParse": 50000,
"inspectionSampleSize": 100,
"maxUrlsToValidate": 2000,
"outputFormat": "json"
},
"elapsedMs": 45000
}

Sitemap Health Score & Grade

The Health Score is a 0-100 metric that measures how well-maintained your sitemaps are. Each score maps to a letter grade for instant readability in spreadsheets and outreach:

ScoreGradeLabelMeaning
90-100AExcellentSitemaps are clean and well-maintained
80-89BGoodMinor issues to address
65-79CNeeds ImprovementSignificant problems impacting crawl efficiency
45-64DPoorSerious issues, sitemaps need major cleanup
0-44FCriticalSitemaps are causing active SEO damage

What Reduces the Score

IssueImpactSeverity
URLs returning 404-5 per URL (capped)Error
URLs returning 5xx-5 per URL (capped)Error
URLs blocked by robots.txt-5 per URLError
Canonical tag mismatches-3 per URLError
Noindex pages in sitemap-5 per URLError
Redirecting URLs (301/302)-2 per URL (capped)Warning
Duplicate URLs-1 per URLWarning
Missing lastmod metadata-5 flatWarning
Uniform/missing priority-3 flatWarning
>10% error rate bonus penalty-10

Example Use Cases

1. Quick SEO Health Check

Get an instant health score for any site:

{
"urls": ["https://example.com"]
}

2. Batch Audit Multiple Domains

Audit your site and all your competitors in a single run:

{
"urls": [
"https://mysite.com",
"https://competitor1.com",
"https://competitor2.com",
"https://competitor3.com"
]
}

3. Parse-Only Audit (No Network Requests to URLs)

Fast audit that only discovers and parses sitemaps without validating URLs:

{
"urls": ["https://example.com"],
"validateUrls": false,
"deepInspection": false,
"crawlDepth": 0
}

4. Audit Specific Sitemaps

Audit specific sitemap URLs at non-standard locations:

{
"urls": ["https://example.com"],
"sitemapUrls": [
"https://cdn.example.com/sitemaps/main.xml",
"https://example.com/custom-sitemap.xml.gz"
],
"autoDiscover": false
}

5. Deep Inspection for International Site

Full hreflang validation for a multi-language site:

{
"urls": ["https://example.com"],
"deepInspection": true,
"inspectionSampleSize": 200,
"crawlDepth": 0
}

6. Large Site Audit with High Throughput

Audit a large site with aggressive concurrency and generous parsing limits:

{
"urls": ["https://large-site.com"],
"maxConcurrency": 20,
"maxPagesToCrawl": 2000,
"maxUrlsToValidate": 5000,
"maxUrlsToParse": 100000,
"inspectionSampleSize": 300,
"crawlDepth": 2
}

7. CSV Export for Reporting

Generate a CSV for spreadsheet analysis:

{
"urls": ["https://example.com", "https://blog.example.com"],
"outputFormat": "csv"
}

8. Agency Client Batch Audit

Audit all client sites in one run with minimal resource usage:

{
"urls": [
"https://client1.com",
"https://client2.com",
"https://client3.com",
"https://client4.com",
"https://client5.com"
],
"deepInspection": false,
"crawlDepth": 0,
"maxConcurrency": 5
}

9. Cold Outreach Lead Qualification (SEO Agencies)

Run a batch of prospect websites, then export the CSV. Sort by grade (D/F = hottest leads), and use the outreachSnippet column directly in your email sequences:

{
"urls": [
"https://prospect1.com",
"https://prospect2.com",
"https://prospect3.com",
"https://prospect4.com",
"https://prospect5.com"
],
"validateUrls": true,
"deepInspection": false,
"crawlDepth": 0,
"maxConcurrency": 10,
"outputFormat": "csv"
}

Outreach workflow:

  1. Load 50-500 prospect domains into the urls array
  2. Run the actor → export dataset as CSV
  3. Open in Google Sheets → sort by grade column (D/F first)
  4. Filter by issueSeverityCounts.critical > 0 for the strongest leads
  5. Copy the outreachSnippet column into your cold email tool — each snippet is a personalized paragraph citing the prospect's specific issues

Key outreach columns in the output:

ColumnWhat It Contains
gradeLetter grade A-F — sort by this to find worst sites
hasSitemapfalse = "You don't even have a sitemap" = easiest pitch
executiveSummary2-3 sentence summary for reports/Slack
outreachSnippetReady-to-paste cold email paragraph with specific issues
topIssuesTop 3 issues with titles — perfect for email bullet points
issueSeverityCounts.criticalNumber of critical issues — higher = hotter lead
sitemapFreshness.daysSinceLastUpdate"Your sitemap hasn't been updated in X days"
sitemapFreshness.freshnessVerdictHuman-readable freshness assessment
protocolIssues.verdictHTTP/HTTPS mismatch detection

How It Works

The actor runs a 7-phase pipeline:

  1. Discover — Fetches robots.txt, extracts Sitemap: directives, and probes 6 common sitemap paths (/sitemap.xml, /sitemap_index.xml, /sitemap.xml.gz, /sitemaps/sitemap.xml, /sitemap/sitemap.xml, /wp-sitemap.xml). Merges with any explicit sitemap URLs you provide.

  2. Parse — Fetches and parses each sitemap. Handles standard <urlset>, <sitemapindex> (recursively follows child sitemaps), gzipped sitemaps, <image:image> entries, and <xhtml:link> hreflang annotations. Uses fast-xml-parser for robust XML handling.

  3. Validate — Sends HEAD requests to every URL with configurable concurrency. Checks HTTP status codes, follows redirects, detects robots.txt conflicts using the parsed rules, and identifies duplicate URLs (case-insensitive, trailing-slash normalized).

  4. Inspect — Fetches the full HTML of a stratified sample of pages. Extracts <link rel="canonical">, <meta name="robots">, and <link rel="alternate" hreflang="..."> tags. Cross-checks canonicals against sitemap URLs, detects noindex directives, and validates hreflang consistency.

  5. Find Missing Pages — Uses Crawlee's CheerioCrawler to spider the site up to a configurable depth. Compares crawled URLs against the sitemap URL set to find pages that exist on the site but are missing from all sitemaps.

  6. Analyze — Classifies every issue as an error or warning, calculates the health score, estimates crawl budget waste, and computes URL distribution statistics.

  7. Recommend — Generates prioritized recommendations sorted by impact (critical → high → medium → low), with affected URL counts and actionable descriptions.

Troubleshooting

No sitemaps found

  • Check robots.txt: The actor looks for Sitemap: directives in robots.txt first. If your sitemap is at a non-standard location, use the sitemapUrls input parameter.
  • WordPress: Most WordPress sites have sitemaps at /wp-sitemap.xml or /sitemap.xml — both are checked automatically.
  • Blocked by robots.txt: Some sites block the sitemap URL itself in robots.txt (uncommon but it happens).

Low health score but site seems fine

  • Redirects count: If your site recently migrated URLs, old URLs in the sitemap will be flagged as redirects (warnings). Update the sitemap with final URLs.
  • Staging content: Sometimes sitemaps include staging or draft URLs that return 404.
  • CDN issues: Some CDNs return different status codes for HEAD vs GET requests. The actor uses HEAD for efficiency.

Actor runs slowly

  • Large sitemaps: Sites with 50K+ URLs take longer. Increase maxConcurrency (up to 50) for faster validation, or reduce maxUrlsToValidate to validate a random sample instead of every URL.
  • Enterprise-scale sitemaps: Sites with 100K+ URLs across dozens of sub-sitemaps can be capped with maxUrlsToParse to prevent memory issues while still producing a useful audit.
  • Deep inspection: Reduce inspectionSampleSize if you don't need comprehensive page-level analysis.
  • Crawl depth: Set crawlDepth to 0 to skip the missing pages crawl entirely.

Running locally

Install dependencies and run:

cd actors/sitemap-audit
npm install
echo '{"urls":["https://example.com"]}' | npx apify-cli run --purge

Or run unit tests:

$npm test

Limitations

  • Sequential domain processing — Multiple domains are audited one after another (not in parallel) to keep memory usage predictable. For massive batches (100+ domains), consider splitting into multiple runs.
  • HEAD request accuracy — Some servers handle HEAD differently than GET. Deep inspection (which uses GET) catches cases where HEAD returns incorrect status codes.
  • Dynamic content — Pages rendered entirely via JavaScript may not have proper canonical/noindex tags visible to the HTML parser. Consider using a browser-based crawler for heavy SPA sites.
  • Rate limiting — High concurrency on small servers may trigger rate limiting. Reduce maxConcurrency or enable proxy if you see many 429 responses.
  • Sitemap size limit — Individual sitemaps larger than 50 MB are skipped (the XML sitemap spec recommends max 50 MB uncompressed). Additionally, total URLs collected across all sub-sitemaps are capped at maxUrlsToParse (default 50,000) to prevent out-of-memory crashes on enterprise-scale sitemap indexes.
  • Missing pages accuracy — The missing pages finder only discovers pages reachable via internal links within the configured crawl depth. Orphan pages with no internal links won't be found.

FAQ

Q: How long does a typical run take? A: Depends on the site size and settings. A small site (< 1K URLs) with full validation takes 30-60 seconds. A large site (50K URLs) with maxConcurrency: 20 takes 5-15 minutes. Parse-only audits (validation/inspection/crawl disabled) complete in seconds.

Q: Do I need proxies? A: Usually not for sitemap auditing. Proxies help if the target site rate-limits or blocks datacenter IPs. The actor defaults to Apify residential proxies when running on the platform.

Q: What if the site has no sitemap? A: The actor reports totalSitemaps: 0 and generates a recommendation to create one. If crawlDepth > 0, it still crawls the site to show what pages exist that should be in a sitemap.

Q: What's the difference between errors and warnings? A: Errors are issues that directly harm SEO: 404 URLs, robots.txt conflicts, canonical mismatches, noindex pages in sitemap, server errors. Warnings are suboptimal but less damaging: redirects, duplicates, missing metadata.

Q: How is crawl budget waste calculated? A: It counts URLs that waste Googlebot's crawl budget: 404s, redirects, robots-blocked URLs, and noindexed pages in the sitemap. The waste percentage is wastedUrls / totalUrls × 100.

Q: Can I audit a sitemap at a non-standard URL? A: Yes — use the sitemapUrls parameter to provide specific sitemap URLs. Set autoDiscover: false to skip the standard discovery process.

Q: Can I use this for cold outreach / lead generation? A: Yes — that's a primary use case. Load prospect domains into the urls array, export the CSV, sort by grade (D/F = hottest leads), and use the outreachSnippet column in your email sequences. Each snippet is a personalized paragraph citing the prospect's specific sitemap issues.

Q: What's the outreachSnippet field? A: A pre-written cold email paragraph customized for each domain. It mentions the specific health score, top issues, crawl budget waste, and missing pages — ready to paste into any cold email tool.

Q: Does this check hreflang tags? A: Yes. It reads hreflang annotations from the XML sitemap (<xhtml:link>) and, during deep inspection, cross-checks them against the <link rel="alternate" hreflang="..."> tags found on the actual pages.

Dataset Views

The Apify Console provides two pre-configured table views:

  1. Overview — Domain, Grade, Score, Has Sitemap, URLs, Critical/High issue counts, Errors, Warnings, Days Since Update, Budget Waste %, Executive Summary
  2. Outreach View — Domain, Grade, Score, Has Sitemap, Critical/High/Total issue counts, Freshness Verdict, Outreach Email Snippet

Switch between views in the Apify Console when viewing dataset results.

Changelog

v1.0.6 (February 2026)

  • maxUrlsToParse input — New safety cap (default 50,000) on total URLs collected during sitemap parsing. Prevents out-of-memory crashes on enterprise-scale sitemap indexes (e.g., sites with 100K+ URLs across dozens of sub-sitemaps). The actor now gracefully truncates and continues instead of crashing.
  • maxUrlsToValidate input — New cap (default 2,000) on URLs validated via HEAD requests. If a sitemap has more URLs than this limit, a random sample is validated. Prevents timeouts on very large sitemaps.
  • Improved large-site resilience — Tested against stripe.com, postman.com, shopify.com, cloudflare.com, and other enterprise-scale sitemaps without failures.

v1.0.3 (February 2026)

  • Letter grade (A-F) for instant sorting and outreach qualification
  • Executive summary — 2-3 sentence audit summary ready for emails/reports
  • Outreach snippet — Pre-written cold email paragraph per domain
  • Top 3 issues — Quick wins list with priority, count, effort, and impact
  • Issue severity counts — Critical/High/Medium/Low at top level for easy filtering
  • Sitemap freshness analysis — Days since last update, stale URL counts, freshness verdict
  • Protocol issue detection — HTTP vs HTTPS mismatch detection in sitemap URLs
  • Outreach dataset view in Apify Console optimized for agency cold outreach workflows

v1.0.0 (February 2026)

  • Initial public release
  • Sitemap Health Score (0-100) with detailed error/warning classification
  • Auto-discovery from robots.txt + 6 common paths
  • XML/gzip/sitemap index recursive parsing with fast-xml-parser
  • Full URL validation: HEAD checks, redirect detection, robots.txt conflict analysis
  • Deep page inspection: canonical tags, meta robots noindex, hreflang cross-validation
  • Missing pages finder using Crawlee CheerioCrawler
  • Crawl budget waste estimation with breakdown by issue type
  • Prioritized recommendations engine (critical → low)
  • URL distribution analysis by path prefix and priority
  • Hreflang support (XML + on-page cross-check)
  • CSV export to key-value store
  • Proxy support via Apify proxy configuration
  • State persistence for Actor migration
  • 239 unit tests across 43 test suites
  • 11 integration test scenarios

Support

  • Issues: Report bugs via GitHub issues or the Apify community forum
  • Feature requests: Contact us through Apify or open a GitHub issue
  • Enterprise: For large-scale sitemap monitoring, reach out for custom pricing

Built by A Page Ventures | Apify Store