Pricing

from $1.00 / 1,000 page crawleds

Sitemap Generator — Full-Site URL Discovery & Crawling

Generate XML sitemaps by crawling websites. Link following, robots.txt respect, configurable depth/limits. Valid XML with lastmod, changefreq, priority. URL inventory with status codes. Ideal for SEO and migrations.

Pricing

from $1.00 / 1,000 page crawleds

Rating

0.0

(0)

Developer

junipr

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Sitemap Generator

Introduction

Sitemap Generator is a production-grade Apify actor that crawls any website and generates a standards-compliant XML sitemap. It discovers all accessible pages through link following, respects configurable depth limits and URL patterns, and detects existing sitemaps from robots.txt and common paths like /sitemap.xml. The actor outputs a ready-to-submit XML sitemap plus a structured page list with metadata including last modified dates, change frequency estimates, and calculated priority values.

Primary use cases:

SEO professionals auditing and generating sitemaps for client sites
Web developers building sitemaps for sites without CMS-generated ones
DevOps teams automating sitemap generation in CI/CD pipelines
Content teams verifying all pages are indexed and discoverable
Migration specialists mapping old site structure for redirects

Key differentiators: JavaScript rendering support via Playwright for SPAs, automatic existing sitemap detection and merging, depth-based priority estimation with inbound link boosting, lastmod detection from HTTP headers and meta tags, and auto-split at 50K URLs per sitemap protocol spec.

Why Use This Actor

Feature	Sitemap Generator	Sitemap Generator (Apify)	XML Sitemap Creator	Screaming Frog
JS-rendered pages	Yes (Playwright)	No	No	Yes (desktop)
Existing sitemap detection	Yes (robots.txt + paths)	No	Partial	Yes
Priority estimation	Depth + link count	None	Static values	Heuristic
lastmod from headers	Yes (multi-source)	No	No	Yes
changefreq estimation	Yes (content heuristic)	No	No	No
Output: XML + JSON	Both	XML only	XML only	XML + CSV
Auto-split >50K URLs	Yes with index	No	No	Yes
Canonical URL handling	Full support	No	No	Yes
PPE pricing	$2/1K pages	Compute-based	Compute-based	License fee
Zero-config	Yes	Yes	Mostly	No

This actor handles the most common pain points with existing sitemap generators: failure on JavaScript-rendered pages, no detection of existing sitemaps, poor deduplication of query-parameterized URLs, and lack of meaningful priority values.

How to Use

Zero-Config Quick Start

Just provide a start URL and run. Everything else has sensible defaults:

{
    "startUrl": "https://example.com"
}

The actor will crawl up to 500 pages, generate an XML sitemap, and store it in the Key-Value Store under the SITEMAP_XML key.

Step-by-Step

Go to the actor's page on Apify Console
Enter your website URL in the Start URL field
(Optional) Adjust max pages, depth, or enable Playwright for JS-heavy sites
Click Start to run the actor
When complete, download the XML sitemap from the Key-Value Store tab (SITEMAP_XML)
Upload the sitemap to Google Search Console or place it at your site's root

Common Configuration Recipes

Quick Audit — Default settings for a fast overview:

{
    "startUrl": "https://example.com",
    "maxPages": 500,
    "crawlerType": "cheerio"
}

Full Site Map — Comprehensive crawl of the entire site:

{
    "startUrl": "https://example.com",
    "maxPages": 50000,
    "maxDepth": 10,
    "crawlerType": "cheerio"
}

Blog Only — Generate sitemap for just the blog section:

{
    "startUrl": "https://example.com/blog",
    "includePatterns": ["/blog/*"],
    "stayWithinPath": true
}

SPA Site — JavaScript-rendered single page application:

{
    "startUrl": "https://app.example.com",
    "crawlerType": "playwright",
    "maxConcurrency": 5
}

Compare with Existing — See what your existing sitemap is missing:

{
    "startUrl": "https://example.com",
    "existingSitemapAction": "compare"
}

Input Configuration

Parameter	Type	Default	Description
`startUrl`	string	required	Root URL to start crawling
`maxPages`	integer	500	Max pages to crawl (1-100,000)
`maxDepth`	integer	5	Link-following depth (0 = start URL only)
`includePatterns`	string[]	`[]`	Glob patterns for URLs to include
`excludePatterns`	string[]	file extensions	Glob patterns for URLs to exclude
`crawlerType`	string	"cheerio"	Engine: "cheerio" (fast) or "playwright" (JS)
`includeLastmod`	boolean	true	Include last modified dates
`includeChangefreq`	boolean	true	Include change frequency
`includePriority`	boolean	true	Include calculated priority
`checkExistingSitemap`	boolean	true	Detect existing sitemaps
`existingSitemapAction`	string	"merge"	merge, replace, or compare
`respectRobotsTxt`	boolean	true	Honor robots.txt directives
`sitemapFormat`	string	"xml"	Output: xml, txt, or both
`splitAtCount`	integer	50000	Auto-split threshold

See the Input Schema tab for the complete list of parameters with detailed descriptions.

Output Format

XML Sitemap (Key-Value Store: `SITEMAP_XML`)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-01-15T10:30:00Z</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2025-12-01T08:00:00Z</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Dataset Item (one per page crawled)

{
    "url": "https://example.com/about",
    "statusCode": 200,
    "depth": 1,
    "title": "About Us",
    "lastModified": "2025-12-01T08:00:00.000Z",
    "changefreq": "monthly",
    "priority": 0.8,
    "contentType": "text/html",
    "responseTimeMs": 245,
    "inSitemap": true,
    "excludeReason": null
}

Run Summary (Key-Value Store: `RUN_SUMMARY`)

{
    "startUrl": "https://example.com",
    "totalPagesCrawled": 347,
    "totalPagesInSitemap": 312,
    "pagesExcluded": 35,
    "pagesSkippedByRobots": 8,
    "pagesFailed": 3,
    "duplicatesRemoved": 12,
    "existingSitemapFound": true,
    "existingSitemapUrls": 290,
    "durationMs": 45200,
    "sitemapSplitCount": 1
}

Tips and Advanced Usage

Optimizing Crawl Speed

Use crawlerType: "cheerio" for static sites — it is 5-10x faster than Playwright and uses far less memory
Increase maxConcurrency for faster crawls on sites that handle high request rates
Set excludeQueryParams: true (default) to avoid crawling the same page with different query strings

URL Pattern Filtering

Use includePatterns to limit the sitemap to specific sections: ["/blog/*", "/products/*"]
Use excludePatterns to skip admin pages, API endpoints, or file downloads
Patterns use glob syntax: * matches anything except /, ** matches anything including /

Existing Sitemap Workflows

Merge (default): Combines crawled URLs with the existing sitemap for a complete picture
Compare: Generates a diff report showing URLs missing from your current sitemap and URLs in the sitemap that are no longer accessible
Replace: Ignores the existing sitemap entirely and generates a fresh one from the crawl

Submitting to Search Engines

After generating your sitemap, download it from the Key-Value Store and either upload it to Google Search Console or place it at your site root. You can also ping search engines programmatically: https://www.google.com/ping?sitemap=https://example.com/sitemap.xml

Pricing

This actor uses Pay-Per-Event (PPE) pricing at $2.00 per 1,000 pages crawled.

A billable event occurs when the actor successfully fetches a URL, processes it, and records the result. You are NOT charged for URLs blocked by robots.txt, duplicate URLs filtered before request, failed requests, or the initial robots.txt and existing sitemap fetches.

Cost Examples

Scenario	Pages	Cost
Small blog (50 pages)	50	$0.10
Business site (200 pages)	200	$0.40
E-commerce (5,000 pages)	5,000	$10.00
News site (50,000 pages)	50,000	$100.00

Plus standard Apify platform compute costs based on memory and runtime.

FAQ

Does it handle JavaScript-rendered pages?

Yes. Set crawlerType to "playwright" to enable full browser rendering. This handles React, Next.js, Vue, Angular, and any other SPA framework. The Playwright mode uses a real Chromium browser to render pages before extracting links, so it discovers routes that only exist in client-side JavaScript.

How does it estimate priority values?

Priority is calculated from two factors: page depth (distance from the homepage) and inbound link count. The homepage always gets priority 1.0. Each additional level of depth reduces priority by 0.2, down to a minimum of 0.1. Pages with many inbound links from other pages on the site receive a boost of up to +0.2.

Can it detect my existing sitemap?

Yes. When checkExistingSitemap is enabled (the default), the actor checks robots.txt for Sitemap directives and probes common paths like /sitemap.xml and /sitemap_index.xml. It supports sitemap index files and will recursively fetch all sub-sitemaps.

What happens if my site has more than 50,000 URLs?

The sitemap protocol limits each sitemap file to 50,000 URLs. When this limit is exceeded, the actor automatically splits the output into multiple sitemap files and generates a sitemap index file (SITEMAP_INDEX_XML in the Key-Value Store) that references all the individual sitemaps.

Does it respect robots.txt?

Yes. The respectRobotsTxt option is enabled by default. The actor parses robots.txt for Disallow directives and Crawl-delay values. Disallowed paths are skipped entirely (never requested), and crawl delay is honored by reducing concurrency.

Can I filter which pages are included?

Yes. Use includePatterns to specify glob patterns for URLs that should appear in the sitemap (e.g., ["/blog/*"]). Use excludePatterns to exclude specific paths or file types. Common binary file extensions are excluded by default.

How often should I regenerate my sitemap?

For most sites, weekly or monthly regeneration is sufficient. For news sites or frequently updated content, consider daily runs. You can schedule the actor on Apify to run automatically at any interval.

What's a "page crawled" for pricing purposes?

A page crawled is any unique URL that the actor successfully fetches and receives a response from (HTTP 2xx or 3xx). Pages that fail to load, URLs blocked by robots.txt, and duplicates filtered before the request is made are not counted as billable events.

Sitemap URL Extractor

wiry_kingdom/sitemap-url-extractor

Extract every URL from any website's sitemap.xml with lastmod, changefreq, priority. Recursively expands sitemap index files, reads robots.txt, handles gzipped sitemaps. SEO audits, content migration, site inventory, competitor research.

Mohieldin Mohamed

Sitemap Generator

alizarin_refrigerator-owner/sitemap-generator

Generate XML sitemaps by crawling any website. Discover all pages, images, & videos with configurable crawl depth, URL filters, & multiple output formats. Full Site Crawling ,Image Sitemap, Video Sitemap, Multiple Output Formats, URL Filtering, Configurable Depth, Last Modified, Webhook Integration

The Howlers

Sitemap Generator

gentle_cloud/sitemap-generator

Crawl websites and generate XML sitemaps with configurable depth and page limits. Discover all pages, extract metadata, and output a ready-to-use sitemap.xml.

Monkey Coder

Sitemap Generator

datawinder/sitemap-generator

Automatically crawl a website and generate an SEO-ready sitemap in XML, HTML, or TXT format. Supports crawl depth limits, URL include/exclude patterns, and optional merging with an existing sitemap.xml. Ideal for SEO audits, site migrations, and automation.

Datawinder

Sitemap & URL Discovery - Find All URLs on Any Site

santamaria-automations/sitemap-url-discovery

Discover every URL on any website by parsing sitemap.xml, robots.txt, and sitemap indexes. Extract URLs with last modified dates, change frequency, and priority. Perfect for SEO audits, content analysis, crawling preparation, and site mapping.

Alessandro Santamaria

Fast Sitemap Generator

eunit/sitemap-generator

Boost SEO with this automatic Sitemap Generator. Crawl any site to create XML, HTML, & TXT sitemaps. Supports custom depth, regex filters, & robots.txt. Compatible with Google Search Console.

Emmanuel Uchenna

5.0

Sitemap URL Extractor

mikolabs/sitemap-url-extractor

Extract every URL and its metadata from any sitemap.xml in seconds. Paste one or more sitemap URLs, run the Actor, and get a clean, structured dataset with url, lastmod, changefreq, priority, and more — ready to export as CSV, JSON, or Excel.

mikolabs

Sitemap Scanner: URL Inventory + Diff

aitoolbreakdown/atb-sitemap-scanner

Point it at a domain. Get the full URL inventory from sitemap.xml + robots.txt, every lastmod date, and a clean JSON feed for SEO audits, content tracking, and change detection.

AI Tool Breakdown

Sitemap Generator

automation-lab/sitemap-generator

Crawl any website and automatically generate a standards-compliant XML sitemap. Discovers all internal pages, extracts last-modified dates and page titles, and lets you download sitemap.xml directly. Configurable crawl depth and URL filters. Ideal for SEO audits and site migrations.

Stas Persiianenko

Sitemap Analyzer — Parse, Validate & Check URLs

accurate_pouch/sitemap-analyzer

Parse XML sitemaps, extract all URLs, validate structure (priority, changefreq, lastmod), optionally check HTTP status of every URL. Supports sitemap indexes.

Manchitt Sanan

Sitemap Generator — Full-Site URL Discovery & Crawling

Sitemap Generator

Introduction

Why Use This Actor

How to Use

Zero-Config Quick Start

Step-by-Step

Common Configuration Recipes

Input Configuration

Output Format

XML Sitemap (Key-Value Store: SITEMAP_XML)

Dataset Item (one per page crawled)

Run Summary (Key-Value Store: RUN_SUMMARY)

Tips and Advanced Usage

Optimizing Crawl Speed

URL Pattern Filtering

Existing Sitemap Workflows

Submitting to Search Engines

Pricing

Cost Examples

FAQ

Does it handle JavaScript-rendered pages?

How does it estimate priority values?

Can it detect my existing sitemap?

What happens if my site has more than 50,000 URLs?

Does it respect robots.txt?

Can I filter which pages are included?

How often should I regenerate my sitemap?

What's a "page crawled" for pricing purposes?

You might also like

Sitemap URL Extractor

Sitemap Generator

Sitemap Generator

Sitemap Generator

Sitemap & URL Discovery - Find All URLs on Any Site

Fast Sitemap Generator

Sitemap URL Extractor

Sitemap Scanner: URL Inventory + Diff

Sitemap Generator

Sitemap Analyzer — Parse, Validate & Check URLs

XML Sitemap (Key-Value Store: `SITEMAP_XML`)

Run Summary (Key-Value Store: `RUN_SUMMARY`)