Sitemap Generator avatar

Sitemap Generator

Pricing

from $7.00 / 1,000 results

Go to Apify Store
Sitemap Generator

Sitemap Generator

Pricing

from $7.00 / 1,000 results

Rating

5.0

(1)

Developer

Sameer Pun

Sameer Pun

Maintained by Community

Actor stats

1

Bookmarked

1

Total users

0

Monthly active users

a day ago

Last modified

Categories

Share

Sitemap Generator Actor

Python Apify Actor that crawls a single hostname and generates sitemap files:

  • sitemap.xml (always an XML sitemap index)
  • sitemap-00001.xml, sitemap-00002.xml, ... (50,000 URLs max per chunk)
  • sitemap.html (optional)
  • sitemap.txt (optional)
  • sitemap-summary.json (run summary and output key references)

The Actor respects robots.txt, includes only canonical URLs, deduplicates by normalized URL, and supports regex include/exclude filters.

Input

FieldTypeDefaultDescription
startUrlstringrequiredStart page (http/https)
maxDepthinteger3Max crawl depth (startUrl is depth 0)
maxPagesinteger1000Max fetched pages
concurrencyinteger10Concurrent HTTP workers (1-50)
allowNoindexbooleanfalseIf true, includes pages with noindex directives
sitemapSeedUrlsstring[][]Optional sitemap XML URLs to seed discovery (in addition to robots.txt Sitemap: entries)
includePatternsstring[][]Regex allow-list for URLs
excludePatternsstring[][]Regex deny-list for URLs
outputFormatsstring[]["html","txt"]Optional extra outputs (html, txt)
lastmodStrategystringheadersheaders or crawl_time
changefreqstringweeklyDefault sitemap changefreq
priorityRules.defaultPrioritynumber0.5Default sitemap priority
priorityRules.rulesobject[][]Ordered regex overrides (pattern, optional priority, optional changefreq)

Run Locally (Apify)

  1. Put your JSON input into storage/key_value_stores/default/INPUT.json.
  2. Run:
$apify run

Example INPUT.json:

{
"startUrl": "https://example.com/",
"maxDepth": 2,
"maxPages": 500,
"concurrency": 10,
"allowNoindex": false,
"includePatterns": [],
"excludePatterns": ["/private", "/preview"],
"outputFormats": ["html", "txt"],
"lastmodStrategy": "headers",
"changefreq": "weekly",
"priorityRules": {
"defaultPriority": 0.5,
"rules": [
{ "pattern": "/docs/", "priority": 0.8, "changefreq": "daily" }
]
}
}

Run Locally (CLI)

The module also supports direct CLI flags:

python -m src \
--start-url https://example.com/ \
--max-depth 2 \
--max-pages 500 \
--concurrency 10 \
--allow-noindex \
--sitemap-seed-url https://example.com/sitemap.xml \
--exclude-pattern /private \
--output-format html \
--output-format txt \
--lastmod-strategy headers \
--changefreq weekly

Priority rules from CLI can be provided as JSON string or file path:

$python -m src --start-url https://example.com/ --priority-rules-json priority-rules.json

CLI precedence is higher than Actor input: CLI > INPUT JSON > defaults.

Output Locations

  • Dataset: one item per included canonical URL (url, lastmod, changefreq, priority, depth, sourceUrl, statusCode, discoveredAt)
  • Key-value store records:
    • sitemap.xml
    • sitemap-00001.xml, sitemap-00002.xml, ...
    • sitemap.html (if enabled)
    • sitemap.txt (if enabled)
    • sitemap-summary.json

Run Tests

$python -m unittest discover -s tests -p "test_*.py"

Integration fixture site is under fixtures/site/.