Sitemap Generator
Pricing
from $1.00 / 1,000 page crawleds
Sitemap Generator
Generate XML sitemaps by crawling websites. Link following, robots.txt respect, configurable depth/limits. Valid XML with lastmod, changefreq, priority. URL inventory with status codes. Ideal for SEO and migrations.
Pricing
from $1.00 / 1,000 page crawleds
Rating
0.0
(0)
Developer

junipr
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Introduction
Sitemap Generator is a production-grade Apify actor that crawls any website and generates a standards-compliant XML sitemap. It discovers all accessible pages through link following, respects configurable depth limits and URL patterns, and detects existing sitemaps from robots.txt and common paths like /sitemap.xml. The actor outputs a ready-to-submit XML sitemap plus a structured page list with metadata including last modified dates, change frequency estimates, and calculated priority values.
Primary use cases:
- SEO professionals auditing and generating sitemaps for client sites
- Web developers building sitemaps for sites without CMS-generated ones
- DevOps teams automating sitemap generation in CI/CD pipelines
- Content teams verifying all pages are indexed and discoverable
- Migration specialists mapping old site structure for redirects
Key differentiators: JavaScript rendering support via Playwright for SPAs, automatic existing sitemap detection and merging, depth-based priority estimation with inbound link boosting, lastmod detection from HTTP headers and meta tags, and auto-split at 50K URLs per sitemap protocol spec.
Why Use This Actor
| Feature | Sitemap Generator | Sitemap Generator (Apify) | XML Sitemap Creator | Screaming Frog |
|---|---|---|---|---|
| JS-rendered pages | Yes (Playwright) | No | No | Yes (desktop) |
| Existing sitemap detection | Yes (robots.txt + paths) | No | Partial | Yes |
| Priority estimation | Depth + link count | None | Static values | Heuristic |
| lastmod from headers | Yes (multi-source) | No | No | Yes |
| changefreq estimation | Yes (content heuristic) | No | No | No |
| Output: XML + JSON | Both | XML only | XML only | XML + CSV |
| Auto-split >50K URLs | Yes with index | No | No | Yes |
| Canonical URL handling | Full support | No | No | Yes |
| PPE pricing | $2/1K pages | Compute-based | Compute-based | License fee |
| Zero-config | Yes | Yes | Mostly | No |
This actor handles the most common pain points with existing sitemap generators: failure on JavaScript-rendered pages, no detection of existing sitemaps, poor deduplication of query-parameterized URLs, and lack of meaningful priority values.
How to Use
Zero-Config Quick Start
Just provide a start URL and run. Everything else has sensible defaults:
{"startUrl": "https://example.com"}
The actor will crawl up to 500 pages, generate an XML sitemap, and store it in the Key-Value Store under the SITEMAP_XML key.
Step-by-Step
- Go to the actor's page on Apify Console
- Enter your website URL in the Start URL field
- (Optional) Adjust max pages, depth, or enable Playwright for JS-heavy sites
- Click Start to run the actor
- When complete, download the XML sitemap from the Key-Value Store tab (
SITEMAP_XML) - Upload the sitemap to Google Search Console or place it at your site's root
Common Configuration Recipes
Quick Audit — Default settings for a fast overview:
{"startUrl": "https://example.com","maxPages": 500,"crawlerType": "cheerio"}
Full Site Map — Comprehensive crawl of the entire site:
{"startUrl": "https://example.com","maxPages": 50000,"maxDepth": 10,"crawlerType": "cheerio"}
Blog Only — Generate sitemap for just the blog section:
{"startUrl": "https://example.com/blog","includePatterns": ["/blog/*"],"stayWithinPath": true}
SPA Site — JavaScript-rendered single page application:
{"startUrl": "https://app.example.com","crawlerType": "playwright","maxConcurrency": 5}
Compare with Existing — See what your existing sitemap is missing:
{"startUrl": "https://example.com","existingSitemapAction": "compare"}
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrl | string | required | Root URL to start crawling |
maxPages | integer | 500 | Max pages to crawl (1-100,000) |
maxDepth | integer | 5 | Link-following depth (0 = start URL only) |
includePatterns | string[] | [] | Glob patterns for URLs to include |
excludePatterns | string[] | file extensions | Glob patterns for URLs to exclude |
crawlerType | string | "cheerio" | Engine: "cheerio" (fast) or "playwright" (JS) |
includeLastmod | boolean | true | Include last modified dates |
includeChangefreq | boolean | true | Include change frequency |
includePriority | boolean | true | Include calculated priority |
checkExistingSitemap | boolean | true | Detect existing sitemaps |
existingSitemapAction | string | "merge" | merge, replace, or compare |
respectRobotsTxt | boolean | true | Honor robots.txt directives |
sitemapFormat | string | "xml" | Output: xml, txt, or both |
splitAtCount | integer | 50000 | Auto-split threshold |
See the Input Schema tab for the complete list of parameters with detailed descriptions.
Output Format
XML Sitemap (Key-Value Store: SITEMAP_XML)
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>https://example.com/</loc><lastmod>2026-01-15T10:30:00Z</lastmod><changefreq>daily</changefreq><priority>1.0</priority></url><url><loc>https://example.com/about</loc><lastmod>2025-12-01T08:00:00Z</lastmod><changefreq>monthly</changefreq><priority>0.8</priority></url></urlset>
Dataset Item (one per page crawled)
{"url": "https://example.com/about","statusCode": 200,"depth": 1,"title": "About Us","lastModified": "2025-12-01T08:00:00.000Z","changefreq": "monthly","priority": 0.8,"contentType": "text/html","responseTimeMs": 245,"inSitemap": true,"excludeReason": null}
Run Summary (Key-Value Store: RUN_SUMMARY)
{"startUrl": "https://example.com","totalPagesCrawled": 347,"totalPagesInSitemap": 312,"pagesExcluded": 35,"pagesSkippedByRobots": 8,"pagesFailed": 3,"duplicatesRemoved": 12,"existingSitemapFound": true,"existingSitemapUrls": 290,"durationMs": 45200,"sitemapSplitCount": 1}
Tips and Advanced Usage
Optimizing Crawl Speed
- Use
crawlerType: "cheerio"for static sites — it is 5-10x faster than Playwright and uses far less memory - Increase
maxConcurrencyfor faster crawls on sites that handle high request rates - Set
excludeQueryParams: true(default) to avoid crawling the same page with different query strings
URL Pattern Filtering
- Use
includePatternsto limit the sitemap to specific sections:["/blog/*", "/products/*"] - Use
excludePatternsto skip admin pages, API endpoints, or file downloads - Patterns use glob syntax:
*matches anything except/,**matches anything including/
Existing Sitemap Workflows
- Merge (default): Combines crawled URLs with the existing sitemap for a complete picture
- Compare: Generates a diff report showing URLs missing from your current sitemap and URLs in the sitemap that are no longer accessible
- Replace: Ignores the existing sitemap entirely and generates a fresh one from the crawl
Submitting to Search Engines
After generating your sitemap, download it from the Key-Value Store and either upload it to Google Search Console or place it at your site root. You can also ping search engines programmatically: https://www.google.com/ping?sitemap=https://example.com/sitemap.xml
Pricing
This actor uses Pay-Per-Event (PPE) pricing at $2.00 per 1,000 pages crawled.
A billable event occurs when the actor successfully fetches a URL, processes it, and records the result. You are NOT charged for URLs blocked by robots.txt, duplicate URLs filtered before request, failed requests, or the initial robots.txt and existing sitemap fetches.
Cost Examples
| Scenario | Pages | Cost |
|---|---|---|
| Small blog (50 pages) | 50 | $0.10 |
| Business site (200 pages) | 200 | $0.40 |
| E-commerce (5,000 pages) | 5,000 | $10.00 |
| News site (50,000 pages) | 50,000 | $100.00 |
Plus standard Apify platform compute costs based on memory and runtime.
FAQ
Does it handle JavaScript-rendered pages?
Yes. Set crawlerType to "playwright" to enable full browser rendering. This handles React, Next.js, Vue, Angular, and any other SPA framework. The Playwright mode uses a real Chromium browser to render pages before extracting links, so it discovers routes that only exist in client-side JavaScript.
How does it estimate priority values?
Priority is calculated from two factors: page depth (distance from the homepage) and inbound link count. The homepage always gets priority 1.0. Each additional level of depth reduces priority by 0.2, down to a minimum of 0.1. Pages with many inbound links from other pages on the site receive a boost of up to +0.2.
Can it detect my existing sitemap?
Yes. When checkExistingSitemap is enabled (the default), the actor checks robots.txt for Sitemap directives and probes common paths like /sitemap.xml and /sitemap_index.xml. It supports sitemap index files and will recursively fetch all sub-sitemaps.
What happens if my site has more than 50,000 URLs?
The sitemap protocol limits each sitemap file to 50,000 URLs. When this limit is exceeded, the actor automatically splits the output into multiple sitemap files and generates a sitemap index file (SITEMAP_INDEX_XML in the Key-Value Store) that references all the individual sitemaps.
Does it respect robots.txt?
Yes. The respectRobotsTxt option is enabled by default. The actor parses robots.txt for Disallow directives and Crawl-delay values. Disallowed paths are skipped entirely (never requested), and crawl delay is honored by reducing concurrency.
Can I filter which pages are included?
Yes. Use includePatterns to specify glob patterns for URLs that should appear in the sitemap (e.g., ["/blog/*"]). Use excludePatterns to exclude specific paths or file types. Common binary file extensions are excluded by default.
How often should I regenerate my sitemap?
For most sites, weekly or monthly regeneration is sufficient. For news sites or frequently updated content, consider daily runs. You can schedule the actor on Apify to run automatically at any interval.
What's a "page crawled" for pricing purposes?
A page crawled is any unique URL that the actor successfully fetches and receives a response from (HTTP 2xx or 3xx). Pages that fail to load, URLs blocked by robots.txt, and duplicates filtered before the request is made are not counted as billable events.