Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG
Pricing
from $0.50 / 1,000 results
Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG
Instantly extract all public URLs from any website sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest cheap way to build URL lists for RAG pipelines, LLM training datasets, SEO audits and content inventories. Zero-config, no proxy required.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
1
Bookmarked
33
Total users
4
Monthly active users
an hour ago
Last modified
Categories
Share
Sitemap to URL Crawler — RAG & AI Data Feeder

Extract every public URL from any website's sitemap.xml — recursively, instantly, and at scale. Handles nested sitemap indexes automatically. The fastest and most cost-efficient way to build complete URL lists for RAG pipelines, LLM training datasets, SEO audits, and content inventories. Zero configuration required.
What Is This Actor?
A sitemap is an XML file that lists every page a website wants search engines to discover. Almost every modern website exposes one at /sitemap.xml. This actor reads those files, follows nested sitemap indexes to any depth, and returns a clean, structured list of every URL — along with metadata like last modification date, change frequency, and priority score.
Built for:
- 🤖 RAG pipelines — feed a complete site URL list into your retrieval-augmented generation system
- 🧠 LLM training & fine-tuning — collect source URLs for web content datasets
- 🔍 SEO audits — inventory every indexed page on a domain
- 📊 Content analysis — understand site structure and publishing cadence
- 🕷️ Pre-crawl URL discovery — use sitemap URLs as seeds for deeper crawlers
- 📦 Data archival — snapshot all public URLs of a website at a point in time
Features
- Recursive sitemap index traversal — automatically follows
<sitemap>index files to any depth, no matter how many levels of nesting - Auto-detection — paste a plain domain URL (e.g.
https://example.com) and the actor automatically appends/sitemap.xml - Rich metadata extraction — captures
lastmod,changefreq, andpriorityfields alongside every URL - Source tracking — each URL is tagged with the sitemap file it came from
- Configurable URL cap — set a
maxUrlslimit to control run cost and size - Proxy support — built-in Apify Proxy integration to handle IP blocks on large or protected sites
- Blazing fast — concurrent fetching with up to 10 parallel requests; no browser, no JavaScript rendering overhead
- Zero config — works out of the box with a single URL input
- Export-ready — results available in JSON, CSV, and Excel via Apify Dataset
Output Data
Each record in the output dataset represents a single URL discovered from the sitemap.
| Field | Type | Description |
|---|---|---|
url | string | The full page URL extracted from <loc> |
lastmod | string | null | Last modification date from <lastmod> (ISO 8601 or as provided) |
changefreq | string | null | Suggested update frequency: always, hourly, daily, weekly, monthly, yearly, never |
priority | number | null | Crawl priority score from 0.0 to 1.0 |
sourceSitemap | string | URL of the sitemap file this entry was found in |
Sample Output Record
{"url": "https://apify.com/blog/what-is-web-scraping","lastmod": "2025-04-18","changefreq": "monthly","priority": 0.8,"sourceSitemap": "https://apify.com/blog-sitemap.xml"}
Sample Output — Sitemap Index Site
For a large site like nytimes.com, the actor might traverse:
/sitemap.xml└── /sitemaps/sitemap-articles-2025.xml → 1,200 URLs└── /sitemaps/sitemap-sections.xml → 340 URLs└── /sitemaps/sitemap-authors.xml → 89 URLs
All URLs from all levels are merged into a single flat dataset.
Input Configuration
startUrls · array · required
A list of starting URLs. You can provide:
-
A plain domain — the actor automatically appends
/sitemap.xmlhttps://www.nytimes.com→ tries https://www.nytimes.com/sitemap.xml -
A direct sitemap URL — the actor fetches it immediately
https://apify.com/sitemap.xml -
A sitemap index URL — the actor recursively traverses all child sitemaps
https://www.shopify.com/sitemap_index.xml
Multiple entries are supported — you can batch multiple domains in a single run.
Default prefill: https://apify.com/sitemap.xml
maxUrls · integer · default: 10000
The maximum number of URLs to extract across all sitemaps. The actor stops collecting new URLs once this limit is reached, even if more sitemaps remain unprocessed.
Set a higher value (or remove the cap) for full-site inventories. For quick discovery runs, a lower value saves cost.
proxyConfiguration · object · default: Apify Proxy enabled
Controls the proxy used for HTTP requests. Using a proxy is recommended to:
- Avoid IP-based rate limiting on high-traffic domains
- Access sitemaps on sites that block datacenter IPs
Default: Apify Proxy is enabled automatically.
You can configure it to use specific proxy groups, residential proxies, or disable it entirely for public sites that don't require one.
{ "useApifyProxy": true }
Usage Examples
Example 1 — Scrape a full website by domain
{"startUrls": [{ "url": "https://www.vercel.com" }],"maxUrls": 5000,"proxyConfiguration": { "useApifyProxy": true }}
The actor auto-detects https://www.vercel.com/sitemap.xml and extracts all URLs.
Example 2 — Direct sitemap XML link
{"startUrls": [{ "url": "https://apify.com/sitemap.xml" }],"maxUrls": 10000,"proxyConfiguration": { "useApifyProxy": true }}
Example 3 — Multiple domains in one run
{"startUrls": [{ "url": "https://www.shopify.com/sitemap_index.xml" },{ "url": "https://stripe.com/sitemap.xml" },{ "url": "https://www.notion.so" }],"maxUrls": 20000,"proxyConfiguration": { "useApifyProxy": true }}
All three sitemaps are fetched concurrently and merged into a single dataset.
Example 4 — RAG pipeline seed (limited URL sample)
{"startUrls": [{ "url": "https://docs.anthropic.com" }],"maxUrls": 500,"proxyConfiguration": { "useApifyProxy": false }}
Grab the first 500 documentation URLs to feed into a retrieval system, without needing a proxy for public docs sites.
How It Works
The actor follows a simple but robust two-phase logic for every URL it encounters:
Phase 1 — Input Normalization
For each entry in startUrls:
- If the URL ends in
.xmlor containssitemap→ fetch it directly - Otherwise → strip the trailing slash and append
/sitemap.xml
Phase 2 — Recursive XML Parsing
Each fetched XML file is parsed with cheerio in XML mode:
If the file is a Sitemap Index (<sitemapindex> with <sitemap><loc> entries):
- All child sitemap URLs are extracted and enqueued for processing
- The actor follows them recursively with up to 10 concurrent workers
If the file is a URL Set (<urlset> with <url><loc> entries):
- All
<url>entries are parsed loc,lastmod,changefreq, andpriorityfields are extracted- Records are pushed to the Apify Dataset in batches
The process continues until all sitemaps are processed or maxUrls is reached.
Input URL│▼Normalize URL (auto-append /sitemap.xml if needed)│▼Fetch XML│├── Sitemap Index? ──► Enqueue all <sitemap><loc> URLs ──► Recurse│└── URL Set? ──► Extract <url> entries ──► Push to Dataset
Performance
| Scenario | Speed | Notes |
|---|---|---|
| Small site (< 1,000 URLs) | Seconds | Single sitemap file |
| Medium site (1,000–50,000 URLs) | 1–3 minutes | Sitemap index with several children |
| Large site (50,000–500,000 URLs) | 5–20 minutes | Deep nested index, many files |
| Very large site (500,000+ URLs) | 20–60+ minutes | Use maxUrls to cap if needed |
Concurrency: Up to 10 sitemap files fetched in parallel.
No browser overhead: Pure HTTP + XML parsing — no Chromium, no JavaScript rendering. This keeps memory usage low and speed high.
Export Formats
Once the actor run completes, download your results from the Apify Dataset in:
- JSON — full structured records including all metadata fields
- CSV — flat table, ready for Excel, Google Sheets, or pandas
- Excel (.xlsx) — native spreadsheet format
- JSONL — one record per line, ideal for streaming into LLM pipelines
- XML — structured markup format
Navigate to Storage → Dataset → Export in the Apify Console to download.
Common Use Cases in Detail
RAG & LLM Pipelines
This actor is the ideal first step in any web-based AI data pipeline:
- Run this actor to get a complete URL list from your target site
- Feed those URLs into a content scraper (e.g. Apify Website Content Crawler)
- Chunk and embed the extracted text into your vector database
- Query with your RAG system
Using sitemap URLs instead of crawling guarantees you only collect pages the site owner has explicitly published — no 404s, no internal admin pages, no duplicates.
SEO Audits
Export the full URL list to CSV and cross-reference with:
- Google Search Console for indexation coverage
- Your analytics tool (GA4, Plausible) for traffic gaps
- A screaming frog crawl for technical issues
The lastmod and priority fields give you additional signal about which pages the site considers most important.
Competitive Research
Sitemaps reveal a competitor's full content architecture — blog posts, product pages, landing pages, documentation — without any guesswork. Run this actor on any public domain to map their content strategy.
Limitations
- Sitemap must be publicly accessible. Password-protected or robots.txt-blocked sitemaps cannot be fetched without credentials (not supported).
- Not all websites have a sitemap. The actor will fail gracefully if
/sitemap.xmlreturns a 404. - Some sitemaps are dynamically generated. Sitemaps rendered by JavaScript (not returned as raw XML) are not supported; the actor only parses static XML responses.
lastmod,changefreq, andpriorityare optional fields in the sitemap spec. Many sites omit them; those fields will benullin the output.- Auto-detection only tries
/sitemap.xml. Some sites place their sitemap at a custom path declared inrobots.txt(e.g./sitemap_index.xml,/sitemaps/main.xml). In those cases, provide the direct sitemap URL as input.
Frequently Asked Questions
Q: What if the website doesn't have a sitemap?
The actor will log an error for that URL and move on. No data will be output for that domain. You can verify manually by visiting https://yourdomain.com/sitemap.xml in your browser.
Q: Can I scrape multiple websites in one run?
Yes. Add multiple entries to startUrls. All results are combined into a single dataset, with each record tagged with its sourceSitemap URL so you can distinguish them.
Q: How deep does recursive traversal go?
Unlimited. The actor follows sitemap index files at any nesting depth until it reaches leaf URL sets or hits the maxUrls limit.
Q: What does maxUrls limit exactly?
It limits the total number of <url> entries extracted and saved to the dataset. It does not limit the number of sitemap XML files fetched.
Q: Do I need a proxy?
For most public websites, no proxy is needed. However, large or popular sites (major news outlets, e-commerce platforms) may rate-limit repeated requests. Apify Proxy is enabled by default and recommended for robustness.
Q: Can I use this to feed a LangChain or LlamaIndex pipeline?
Yes — export the dataset as JSON or JSONL and use the URL list as input to any document loader that accepts a list of URLs. This actor is especially useful as the URL discovery step before running a content scraper.
Q: What happens if a sub-sitemap returns an error?
The failedRequestHandler logs the error and the actor continues with the remaining sitemaps. One failed sub-sitemap does not abort the entire run.
Q: Is the output deduplicated?
Individual sitemap files are not re-fetched (Crawlee's request deduplication prevents this). However, if a URL appears in multiple sitemap files on the same site, it may appear more than once in the output. Post-process with a simple unique by url step if needed.
Technical Details
| Property | Value |
|---|---|
| Runtime | Node.js 20 (ES Modules) |
| Framework | Apify SDK v3 + Crawlee |
| HTTP client | got-scraping (browser-like headers, proxy support) |
| XML parser | cheerio (XML mode) |
| Max concurrency | 10 parallel sitemap fetches |
| Request timeout | 30 seconds per sitemap file |
| Max crawl requests | 500 XML files per run (configurable) |
| Memory footprint | Very low — no browser, pure HTTP |
Changelog
v1.0.0
- Initial release
- Recursive sitemap index traversal
- Auto-detection of sitemap URL from plain domain input
- Extraction of
url,lastmod,changefreq,priority, andsourceSitemap maxUrlslimit with early termination- Apify Proxy integration
- JSON, CSV, and Excel export via Apify Dataset
Support
If you encounter issues — sitemaps returning unexpected formats, auto-detection failing, or proxy errors — please open a support ticket via the Apify Console or reach out through the Apify community forum. Include the target URL and the actor run ID in your report.