Pricing

from $0.50 / 1,000 results

Sitemap to URL Crawler

nstantly extract all public URLs from any website's sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest & cheapest way to build URL lists for RAG pipelines, LLM training, and SEO audits. Zero-config & blazing fast.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

🗺️ Sitemap to URL Generator — Extract All URLs from sitemap.xml (RAG, SEO, AI Agents)

Instantly extract all public URLs from any website sitemap.
The perfect first step for RAG pipelines, LLM training datasets, AI browsing agents, and SEO audits.

Before you can scrape content, you need a complete list of pages. This Actor recursively crawls sitemap.xml (including nested Sitemap Index files) and produces a clean, structured URL dataset with metadata like lastmod, changefreq, and priority.

✅ What this Actor does (in one sentence)

Given a domain or sitemap URL, it discovers and crawls all sitemaps, extracts every public page URL, and outputs a deduplicated URL list with SEO metadata.

🚀 Why use this Actor?

Use Case	How it helps
RAG Pipelines	Generate a complete URL list to feed into “Website to Markdown” or your vector DB pipeline.
AI Agents	Give your agent a full map of a domain so it can browse and retrieve relevant pages.
SEO Audits	Extract sitemap metadata (last modified date, change frequency, priority) at scale.
Competitor Analysis	See every blog post, product page, category page, or landing page a competitor publishes.
Content Monitoring	Detect new pages by comparing `lastmod` across runs.
Site Migration / QA	Validate sitemap coverage and identify missing sections quickly.

✨ Features

Recursive Sitemap Crawling
Automatically handles Sitemap Index files and nested sitemap references.
Smart Auto-Discovery
Paste a domain like https://example.com and it will try https://example.com/sitemap.xml.
SEO Metadata Extraction
Extracts: lastmod, changefreq, priority (when available).
Blazing Fast & Cost Efficient
Uses raw HTTP requests (no browser overhead).
Proxy Ready
Supports Apify Proxy to reduce IP blocks and request throttling.

🧠 Common sitemap formats supported

This Actor is designed to handle typical sitemap patterns used by:

WordPress / WooCommerce
Shopify stores
Headless CMS sites
News publishers
Large e-commerce catalogs with sitemap indexes
Multi-sitemap architectures (sitemap-posts.xml, sitemap-products.xml, etc.)

🛠 How to Use

Add one or more Start URLs:
- a domain: https://www.nytimes.com
- or a direct sitemap: https://example.com/sitemap.xml
Set Max URLs (optional) to cap output size on huge sites.
(Recommended) Enable Proxy Configuration to avoid throttling.
Run the Actor and export results as JSON/CSV/Excel.

⚙️ Input Configuration

This Actor uses the following input schema in Apify:

`startUrls` (required)

A list of domains or sitemap links:

https://example.com
https://example.com/sitemap.xml
https://example.com/sitemap_index.xml

`maxUrls` (optional)

Limit the number of extracted URLs (default: 10000).

`proxyConfiguration` (optional, recommended)

Use Apify Proxy to reduce blocks and increase reliability.

Example Input (JSON)

{
  "startUrls": [
    { "url": "https://example.com" }
  ],
  "maxUrls": 10000,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}

📦 Output (Dataset)

The Actor produces a dataset called “Sitemap URLs” with a clean schema:

url — Page URL extracted from sitemap

lastmod — Last modified date (if provided)

changefreq — Change frequency (if provided)

priority — Sitemap priority score (if provided)

Output Example (JSON) { "url": "https://example.com/blog/ai-trends-2026", "lastmod": "2026-01-10", "changefreq": "weekly", "priority": 0.8 }

📊 Dataset View (URL List)

This Actor includes a dataset table view in Apify with:

URL (clickable link)

Last Modified (date)

Change Freq (text)

Priority (number)

This is ideal for quick validation before exporting or passing results into the next pipeline step.

🔗 Recommended pipeline (SEO + AI data factory)

This Actor is the discovery layer for your AI content ingestion pipeline:

Sitemap to URL Generator → get all pages

Website to Markdown (RAG Ready) → convert pages to clean Markdown

Vector DB ingestion (Pinecone / Qdrant / Weaviate / OpenAI Vector Store)

AI Agent / RAG app → retrieval + reasoning

This modular workflow is more reliable and cheaper than crawling blindly.

🔥 Pro Tips (maximize coverage + minimize cost)

Start with the domain, not a guessed sitemap path

If you only know the domain, input https://example.com. The Actor will attempt to auto-discover the sitemap entry point.

Use maxUrls to control cost

Large e-commerce sites can have hundreds of thousands of URLs. Start with:

maxUrls = 5000 for a first scan

increase gradually once your pipeline is stable

Use metadata for incremental updates

Many sites update lastmod. You can:

run daily

filter pages updated after your last run

scrape only the new/changed pages downstream

Segment by sitemap source (optional workflow)

If your implementation includes sourceSitemap, you can track which sitemap produced each URL, enabling category-specific scrapes (posts vs products vs pages).

🧯 Troubleshooting

I get blocked / throttled

Enable proxyConfiguration.useApifyProxy = true

Reduce total extracted URLs per run with maxUrls

Run in batches and merge results

The sitemap is missing URLs I can see on the site

Not all sites list every page in sitemaps. This is common. For full discovery, use a hybrid strategy:

sitemap URLs (high precision)

plus internal link crawl (high recall) via a crawler actor

Some fields are empty (lastmod, priority, changefreq)

These fields are optional in the sitemap spec. Many sites omit them.

💰 Pricing & Efficiency

This Actor is optimized for speed and low compute usage:

No browser automation

Minimal overhead

High throughput on large sitemap indexes

🎯 SEO Keywords (what this Actor is built for)

If you are searching for:

sitemap scraper

sitemap.xml URL extractor

extract URLs from sitemap

sitemap index crawler

SEO audit sitemap

RAG pipeline URL discovery

LLM training URL list

competitor sitemap analysis

website URL generator

This Actor is designed to be the fastest, cleanest solution on Apify for that job.

🗺 Roadmap

Planned improvements:

robots.txt sitemap discovery support

optional sourceSitemap field per URL

built-in dedup + canonicalization controls

delta mode (only new/updated URLs since last run)

Support & Feedback

If you have feature requests or encounter an issue, open a ticket in the Actor page. Include:

your start URL

a sample sitemap URL

the approximate size (URL count) you expect

XML Sitemap URL Extractor

andok/sitemap-extractor

Recursively crawl and extract every single URL from a website’s sitemap.xml. Automate your SEO audits and scraping queues.

Andok

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

241

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a website link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).

One Scales

421

5.0

Sitemap Extractor

cerebral_aluminum/sitemap-extractor

Extract all URLs from website sitemaps. Pages, images, PDFs. Handles sitemap indexes and WordPress.

Benny

Sitemap API

vivid_astronaut/sitemap

Fabio Suizu

Sitemap Generator - Crawl Website & Create XML Sitemap

scrappy_garden/sitemap-generator

Generate an XML sitemap for any website. Crawls internal pages from start URLs (with depth + page limits), deduplicates URLs, and stores a ready-to-submit sitemap.xml plus a structured dataset and summary for SEO audits.

Bikram Adhikari

Sitemap Analyzer API | sitemap.xml SEO Audit

taroyamada/sitemap-analyzer

Analyze sitemap.xml files for structure, freshness, broken URLs, and crawl-ready SEO insights at scale.

太郎山田

Sitemap Generator

himalyancoder/Sitemap-generator

Sameer Pun

Xml Sitemap Validator

zerobreak/xml-sitemap-validator

XML sitemap validator that crawls every URL in your sitemap and flags broken links, redirect chains, and structural errors — so SEO teams can audit sitemap health in seconds.

ZeroBreak

Sitemap Generator

datawinder/sitemap-generator

Automatically crawl a website and generate an SEO-ready sitemap in XML, HTML, or TXT format. Supports crawl depth limits, URL include/exclude patterns, and optional merging with an existing sitemap.xml. Ideal for SEO audits, site migrations, and automation.