Sitemap Generator avatar

Sitemap Generator

Pricing

from $1.00 / 1,000 page crawleds

Go to Apify Store
Sitemap Generator

Sitemap Generator

Generate XML sitemaps by crawling websites. Link following, robots.txt respect, configurable depth/limits. Valid XML with lastmod, changefreq, priority. URL inventory with status codes. Ideal for SEO and migrations.

Pricing

from $1.00 / 1,000 page crawleds

Rating

0.0

(0)

Developer

junipr

junipr

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Introduction

Sitemap Generator is a production-grade Apify actor that crawls any website and generates a standards-compliant XML sitemap. It discovers all accessible pages through link following, respects configurable depth limits and URL patterns, and detects existing sitemaps from robots.txt and common paths like /sitemap.xml. The actor outputs a ready-to-submit XML sitemap plus a structured page list with metadata including last modified dates, change frequency estimates, and calculated priority values.

Primary use cases:

  • SEO professionals auditing and generating sitemaps for client sites
  • Web developers building sitemaps for sites without CMS-generated ones
  • DevOps teams automating sitemap generation in CI/CD pipelines
  • Content teams verifying all pages are indexed and discoverable
  • Migration specialists mapping old site structure for redirects

Key differentiators: JavaScript rendering support via Playwright for SPAs, automatic existing sitemap detection and merging, depth-based priority estimation with inbound link boosting, lastmod detection from HTTP headers and meta tags, and auto-split at 50K URLs per sitemap protocol spec.

Why Use This Actor

FeatureSitemap GeneratorSitemap Generator (Apify)XML Sitemap CreatorScreaming Frog
JS-rendered pagesYes (Playwright)NoNoYes (desktop)
Existing sitemap detectionYes (robots.txt + paths)NoPartialYes
Priority estimationDepth + link countNoneStatic valuesHeuristic
lastmod from headersYes (multi-source)NoNoYes
changefreq estimationYes (content heuristic)NoNoNo
Output: XML + JSONBothXML onlyXML onlyXML + CSV
Auto-split >50K URLsYes with indexNoNoYes
Canonical URL handlingFull supportNoNoYes
PPE pricing$2/1K pagesCompute-basedCompute-basedLicense fee
Zero-configYesYesMostlyNo

This actor handles the most common pain points with existing sitemap generators: failure on JavaScript-rendered pages, no detection of existing sitemaps, poor deduplication of query-parameterized URLs, and lack of meaningful priority values.

How to Use

Zero-Config Quick Start

Just provide a start URL and run. Everything else has sensible defaults:

{
"startUrl": "https://example.com"
}

The actor will crawl up to 500 pages, generate an XML sitemap, and store it in the Key-Value Store under the SITEMAP_XML key.

Step-by-Step

  1. Go to the actor's page on Apify Console
  2. Enter your website URL in the Start URL field
  3. (Optional) Adjust max pages, depth, or enable Playwright for JS-heavy sites
  4. Click Start to run the actor
  5. When complete, download the XML sitemap from the Key-Value Store tab (SITEMAP_XML)
  6. Upload the sitemap to Google Search Console or place it at your site's root

Common Configuration Recipes

Quick Audit — Default settings for a fast overview:

{
"startUrl": "https://example.com",
"maxPages": 500,
"crawlerType": "cheerio"
}

Full Site Map — Comprehensive crawl of the entire site:

{
"startUrl": "https://example.com",
"maxPages": 50000,
"maxDepth": 10,
"crawlerType": "cheerio"
}

Blog Only — Generate sitemap for just the blog section:

{
"startUrl": "https://example.com/blog",
"includePatterns": ["/blog/*"],
"stayWithinPath": true
}

SPA Site — JavaScript-rendered single page application:

{
"startUrl": "https://app.example.com",
"crawlerType": "playwright",
"maxConcurrency": 5
}

Compare with Existing — See what your existing sitemap is missing:

{
"startUrl": "https://example.com",
"existingSitemapAction": "compare"
}

Input Configuration

ParameterTypeDefaultDescription
startUrlstringrequiredRoot URL to start crawling
maxPagesinteger500Max pages to crawl (1-100,000)
maxDepthinteger5Link-following depth (0 = start URL only)
includePatternsstring[][]Glob patterns for URLs to include
excludePatternsstring[]file extensionsGlob patterns for URLs to exclude
crawlerTypestring"cheerio"Engine: "cheerio" (fast) or "playwright" (JS)
includeLastmodbooleantrueInclude last modified dates
includeChangefreqbooleantrueInclude change frequency
includePrioritybooleantrueInclude calculated priority
checkExistingSitemapbooleantrueDetect existing sitemaps
existingSitemapActionstring"merge"merge, replace, or compare
respectRobotsTxtbooleantrueHonor robots.txt directives
sitemapFormatstring"xml"Output: xml, txt, or both
splitAtCountinteger50000Auto-split threshold

See the Input Schema tab for the complete list of parameters with detailed descriptions.

Output Format

XML Sitemap (Key-Value Store: SITEMAP_XML)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-01-15T10:30:00Z</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2025-12-01T08:00:00Z</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>

Dataset Item (one per page crawled)

{
"url": "https://example.com/about",
"statusCode": 200,
"depth": 1,
"title": "About Us",
"lastModified": "2025-12-01T08:00:00.000Z",
"changefreq": "monthly",
"priority": 0.8,
"contentType": "text/html",
"responseTimeMs": 245,
"inSitemap": true,
"excludeReason": null
}

Run Summary (Key-Value Store: RUN_SUMMARY)

{
"startUrl": "https://example.com",
"totalPagesCrawled": 347,
"totalPagesInSitemap": 312,
"pagesExcluded": 35,
"pagesSkippedByRobots": 8,
"pagesFailed": 3,
"duplicatesRemoved": 12,
"existingSitemapFound": true,
"existingSitemapUrls": 290,
"durationMs": 45200,
"sitemapSplitCount": 1
}

Tips and Advanced Usage

Optimizing Crawl Speed

  • Use crawlerType: "cheerio" for static sites — it is 5-10x faster than Playwright and uses far less memory
  • Increase maxConcurrency for faster crawls on sites that handle high request rates
  • Set excludeQueryParams: true (default) to avoid crawling the same page with different query strings

URL Pattern Filtering

  • Use includePatterns to limit the sitemap to specific sections: ["/blog/*", "/products/*"]
  • Use excludePatterns to skip admin pages, API endpoints, or file downloads
  • Patterns use glob syntax: * matches anything except /, ** matches anything including /

Existing Sitemap Workflows

  • Merge (default): Combines crawled URLs with the existing sitemap for a complete picture
  • Compare: Generates a diff report showing URLs missing from your current sitemap and URLs in the sitemap that are no longer accessible
  • Replace: Ignores the existing sitemap entirely and generates a fresh one from the crawl

Submitting to Search Engines

After generating your sitemap, download it from the Key-Value Store and either upload it to Google Search Console or place it at your site root. You can also ping search engines programmatically: https://www.google.com/ping?sitemap=https://example.com/sitemap.xml

Pricing

This actor uses Pay-Per-Event (PPE) pricing at $2.00 per 1,000 pages crawled.

A billable event occurs when the actor successfully fetches a URL, processes it, and records the result. You are NOT charged for URLs blocked by robots.txt, duplicate URLs filtered before request, failed requests, or the initial robots.txt and existing sitemap fetches.

Cost Examples

ScenarioPagesCost
Small blog (50 pages)50$0.10
Business site (200 pages)200$0.40
E-commerce (5,000 pages)5,000$10.00
News site (50,000 pages)50,000$100.00

Plus standard Apify platform compute costs based on memory and runtime.

FAQ

Does it handle JavaScript-rendered pages?

Yes. Set crawlerType to "playwright" to enable full browser rendering. This handles React, Next.js, Vue, Angular, and any other SPA framework. The Playwright mode uses a real Chromium browser to render pages before extracting links, so it discovers routes that only exist in client-side JavaScript.

How does it estimate priority values?

Priority is calculated from two factors: page depth (distance from the homepage) and inbound link count. The homepage always gets priority 1.0. Each additional level of depth reduces priority by 0.2, down to a minimum of 0.1. Pages with many inbound links from other pages on the site receive a boost of up to +0.2.

Can it detect my existing sitemap?

Yes. When checkExistingSitemap is enabled (the default), the actor checks robots.txt for Sitemap directives and probes common paths like /sitemap.xml and /sitemap_index.xml. It supports sitemap index files and will recursively fetch all sub-sitemaps.

What happens if my site has more than 50,000 URLs?

The sitemap protocol limits each sitemap file to 50,000 URLs. When this limit is exceeded, the actor automatically splits the output into multiple sitemap files and generates a sitemap index file (SITEMAP_INDEX_XML in the Key-Value Store) that references all the individual sitemaps.

Does it respect robots.txt?

Yes. The respectRobotsTxt option is enabled by default. The actor parses robots.txt for Disallow directives and Crawl-delay values. Disallowed paths are skipped entirely (never requested), and crawl delay is honored by reducing concurrency.

Can I filter which pages are included?

Yes. Use includePatterns to specify glob patterns for URLs that should appear in the sitemap (e.g., ["/blog/*"]). Use excludePatterns to exclude specific paths or file types. Common binary file extensions are excluded by default.

How often should I regenerate my sitemap?

For most sites, weekly or monthly regeneration is sufficient. For news sites or frequently updated content, consider daily runs. You can schedule the actor on Apify to run automatically at any interval.

What's a "page crawled" for pricing purposes?

A page crawled is any unique URL that the actor successfully fetches and receives a response from (HTTP 2xx or 3xx). Pages that fail to load, URLs blocked by robots.txt, and duplicates filtered before the request is made are not counted as billable events.