Sitemap Generator
Pricing
from $7.00 / 1,000 results
Go to Apify Store

Sitemap Generator
Pricing
from $7.00 / 1,000 results
Rating
5.0
(1)
Developer

Sameer Pun
Maintained by Community
Actor stats
1
Bookmarked
1
Total users
0
Monthly active users
a day ago
Last modified
Share
Sitemap Generator Actor
Python Apify Actor that crawls a single hostname and generates sitemap files:
sitemap.xml(always an XML sitemap index)sitemap-00001.xml,sitemap-00002.xml, ... (50,000 URLs max per chunk)sitemap.html(optional)sitemap.txt(optional)sitemap-summary.json(run summary and output key references)
The Actor respects robots.txt, includes only canonical URLs, deduplicates by normalized URL, and supports regex include/exclude filters.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string | required | Start page (http/https) |
maxDepth | integer | 3 | Max crawl depth (startUrl is depth 0) |
maxPages | integer | 1000 | Max fetched pages |
concurrency | integer | 10 | Concurrent HTTP workers (1-50) |
allowNoindex | boolean | false | If true, includes pages with noindex directives |
sitemapSeedUrls | string[] | [] | Optional sitemap XML URLs to seed discovery (in addition to robots.txt Sitemap: entries) |
includePatterns | string[] | [] | Regex allow-list for URLs |
excludePatterns | string[] | [] | Regex deny-list for URLs |
outputFormats | string[] | ["html","txt"] | Optional extra outputs (html, txt) |
lastmodStrategy | string | headers | headers or crawl_time |
changefreq | string | weekly | Default sitemap changefreq |
priorityRules.defaultPriority | number | 0.5 | Default sitemap priority |
priorityRules.rules | object[] | [] | Ordered regex overrides (pattern, optional priority, optional changefreq) |
Run Locally (Apify)
- Put your JSON input into
storage/key_value_stores/default/INPUT.json. - Run:
$apify run
Example INPUT.json:
{"startUrl": "https://example.com/","maxDepth": 2,"maxPages": 500,"concurrency": 10,"allowNoindex": false,"includePatterns": [],"excludePatterns": ["/private", "/preview"],"outputFormats": ["html", "txt"],"lastmodStrategy": "headers","changefreq": "weekly","priorityRules": {"defaultPriority": 0.5,"rules": [{ "pattern": "/docs/", "priority": 0.8, "changefreq": "daily" }]}}
Run Locally (CLI)
The module also supports direct CLI flags:
python -m src \--start-url https://example.com/ \--max-depth 2 \--max-pages 500 \--concurrency 10 \--allow-noindex \--sitemap-seed-url https://example.com/sitemap.xml \--exclude-pattern /private \--output-format html \--output-format txt \--lastmod-strategy headers \--changefreq weekly
Priority rules from CLI can be provided as JSON string or file path:
$python -m src --start-url https://example.com/ --priority-rules-json priority-rules.json
CLI precedence is higher than Actor input: CLI > INPUT JSON > defaults.
Output Locations
- Dataset: one item per included canonical URL (
url,lastmod,changefreq,priority,depth,sourceUrl,statusCode,discoveredAt) - Key-value store records:
sitemap.xmlsitemap-00001.xml,sitemap-00002.xml, ...sitemap.html(if enabled)sitemap.txt(if enabled)sitemap-summary.json
Run Tests
$python -m unittest discover -s tests -p "test_*.py"
Integration fixture site is under fixtures/site/.