Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG avatar

Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG

Pricing

from $0.50 / 1,000 results

Go to Apify Store
Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG

Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG

Instantly extract all public URLs from any website sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest cheap way to build URL lists for RAG pipelines, LLM training datasets, SEO audits and content inventories. Zero-config, no proxy required.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

1

Bookmarked

33

Total users

4

Monthly active users

an hour ago

Last modified

Share

Sitemap to URL Crawler — RAG & AI Data Feeder

Sitemap to URL Crawler — Extract Sitemap.xml URLs for RAG

Extract every public URL from any website's sitemap.xml — recursively, instantly, and at scale. Handles nested sitemap indexes automatically. The fastest and most cost-efficient way to build complete URL lists for RAG pipelines, LLM training datasets, SEO audits, and content inventories. Zero configuration required.


What Is This Actor?

A sitemap is an XML file that lists every page a website wants search engines to discover. Almost every modern website exposes one at /sitemap.xml. This actor reads those files, follows nested sitemap indexes to any depth, and returns a clean, structured list of every URL — along with metadata like last modification date, change frequency, and priority score.

Built for:

  • 🤖 RAG pipelines — feed a complete site URL list into your retrieval-augmented generation system
  • 🧠 LLM training & fine-tuning — collect source URLs for web content datasets
  • 🔍 SEO audits — inventory every indexed page on a domain
  • 📊 Content analysis — understand site structure and publishing cadence
  • 🕷️ Pre-crawl URL discovery — use sitemap URLs as seeds for deeper crawlers
  • 📦 Data archival — snapshot all public URLs of a website at a point in time

Features

  • Recursive sitemap index traversal — automatically follows <sitemap> index files to any depth, no matter how many levels of nesting
  • Auto-detection — paste a plain domain URL (e.g. https://example.com) and the actor automatically appends /sitemap.xml
  • Rich metadata extraction — captures lastmod, changefreq, and priority fields alongside every URL
  • Source tracking — each URL is tagged with the sitemap file it came from
  • Configurable URL cap — set a maxUrls limit to control run cost and size
  • Proxy support — built-in Apify Proxy integration to handle IP blocks on large or protected sites
  • Blazing fast — concurrent fetching with up to 10 parallel requests; no browser, no JavaScript rendering overhead
  • Zero config — works out of the box with a single URL input
  • Export-ready — results available in JSON, CSV, and Excel via Apify Dataset

Output Data

Each record in the output dataset represents a single URL discovered from the sitemap.

FieldTypeDescription
urlstringThe full page URL extracted from <loc>
lastmodstring | nullLast modification date from <lastmod> (ISO 8601 or as provided)
changefreqstring | nullSuggested update frequency: always, hourly, daily, weekly, monthly, yearly, never
prioritynumber | nullCrawl priority score from 0.0 to 1.0
sourceSitemapstringURL of the sitemap file this entry was found in

Sample Output Record

{
"url": "https://apify.com/blog/what-is-web-scraping",
"lastmod": "2025-04-18",
"changefreq": "monthly",
"priority": 0.8,
"sourceSitemap": "https://apify.com/blog-sitemap.xml"
}

Sample Output — Sitemap Index Site

For a large site like nytimes.com, the actor might traverse:

/sitemap.xml
└── /sitemaps/sitemap-articles-2025.xml → 1,200 URLs
└── /sitemaps/sitemap-sections.xml → 340 URLs
└── /sitemaps/sitemap-authors.xml → 89 URLs

All URLs from all levels are merged into a single flat dataset.


Input Configuration

startUrls · array · required

A list of starting URLs. You can provide:

  1. A plain domain — the actor automatically appends /sitemap.xml

    https://www.nytimes.com
    → tries https://www.nytimes.com/sitemap.xml
  2. A direct sitemap URL — the actor fetches it immediately

    https://apify.com/sitemap.xml
  3. A sitemap index URL — the actor recursively traverses all child sitemaps

    https://www.shopify.com/sitemap_index.xml

Multiple entries are supported — you can batch multiple domains in a single run.

Default prefill: https://apify.com/sitemap.xml


maxUrls · integer · default: 10000

The maximum number of URLs to extract across all sitemaps. The actor stops collecting new URLs once this limit is reached, even if more sitemaps remain unprocessed.

Set a higher value (or remove the cap) for full-site inventories. For quick discovery runs, a lower value saves cost.


proxyConfiguration · object · default: Apify Proxy enabled

Controls the proxy used for HTTP requests. Using a proxy is recommended to:

  • Avoid IP-based rate limiting on high-traffic domains
  • Access sitemaps on sites that block datacenter IPs

Default: Apify Proxy is enabled automatically.

You can configure it to use specific proxy groups, residential proxies, or disable it entirely for public sites that don't require one.

{ "useApifyProxy": true }

Usage Examples

Example 1 — Scrape a full website by domain

{
"startUrls": [{ "url": "https://www.vercel.com" }],
"maxUrls": 5000,
"proxyConfiguration": { "useApifyProxy": true }
}

The actor auto-detects https://www.vercel.com/sitemap.xml and extracts all URLs.


{
"startUrls": [{ "url": "https://apify.com/sitemap.xml" }],
"maxUrls": 10000,
"proxyConfiguration": { "useApifyProxy": true }
}

Example 3 — Multiple domains in one run

{
"startUrls": [
{ "url": "https://www.shopify.com/sitemap_index.xml" },
{ "url": "https://stripe.com/sitemap.xml" },
{ "url": "https://www.notion.so" }
],
"maxUrls": 20000,
"proxyConfiguration": { "useApifyProxy": true }
}

All three sitemaps are fetched concurrently and merged into a single dataset.


Example 4 — RAG pipeline seed (limited URL sample)

{
"startUrls": [{ "url": "https://docs.anthropic.com" }],
"maxUrls": 500,
"proxyConfiguration": { "useApifyProxy": false }
}

Grab the first 500 documentation URLs to feed into a retrieval system, without needing a proxy for public docs sites.


How It Works

The actor follows a simple but robust two-phase logic for every URL it encounters:

Phase 1 — Input Normalization

For each entry in startUrls:

  • If the URL ends in .xml or contains sitemap → fetch it directly
  • Otherwise → strip the trailing slash and append /sitemap.xml

Phase 2 — Recursive XML Parsing

Each fetched XML file is parsed with cheerio in XML mode:

If the file is a Sitemap Index (<sitemapindex> with <sitemap><loc> entries):

  • All child sitemap URLs are extracted and enqueued for processing
  • The actor follows them recursively with up to 10 concurrent workers

If the file is a URL Set (<urlset> with <url><loc> entries):

  • All <url> entries are parsed
  • loc, lastmod, changefreq, and priority fields are extracted
  • Records are pushed to the Apify Dataset in batches

The process continues until all sitemaps are processed or maxUrls is reached.

Input URL
Normalize URL (auto-append /sitemap.xml if needed)
Fetch XML
├── Sitemap Index? ──► Enqueue all <sitemap><loc> URLs ──► Recurse
└── URL Set? ──► Extract <url> entries ──► Push to Dataset

Performance

ScenarioSpeedNotes
Small site (< 1,000 URLs)SecondsSingle sitemap file
Medium site (1,000–50,000 URLs)1–3 minutesSitemap index with several children
Large site (50,000–500,000 URLs)5–20 minutesDeep nested index, many files
Very large site (500,000+ URLs)20–60+ minutesUse maxUrls to cap if needed

Concurrency: Up to 10 sitemap files fetched in parallel.
No browser overhead: Pure HTTP + XML parsing — no Chromium, no JavaScript rendering. This keeps memory usage low and speed high.


Export Formats

Once the actor run completes, download your results from the Apify Dataset in:

  • JSON — full structured records including all metadata fields
  • CSV — flat table, ready for Excel, Google Sheets, or pandas
  • Excel (.xlsx) — native spreadsheet format
  • JSONL — one record per line, ideal for streaming into LLM pipelines
  • XML — structured markup format

Navigate to Storage → Dataset → Export in the Apify Console to download.


Common Use Cases in Detail

RAG & LLM Pipelines

This actor is the ideal first step in any web-based AI data pipeline:

  1. Run this actor to get a complete URL list from your target site
  2. Feed those URLs into a content scraper (e.g. Apify Website Content Crawler)
  3. Chunk and embed the extracted text into your vector database
  4. Query with your RAG system

Using sitemap URLs instead of crawling guarantees you only collect pages the site owner has explicitly published — no 404s, no internal admin pages, no duplicates.

SEO Audits

Export the full URL list to CSV and cross-reference with:

  • Google Search Console for indexation coverage
  • Your analytics tool (GA4, Plausible) for traffic gaps
  • A screaming frog crawl for technical issues

The lastmod and priority fields give you additional signal about which pages the site considers most important.

Competitive Research

Sitemaps reveal a competitor's full content architecture — blog posts, product pages, landing pages, documentation — without any guesswork. Run this actor on any public domain to map their content strategy.


Limitations

  • Sitemap must be publicly accessible. Password-protected or robots.txt-blocked sitemaps cannot be fetched without credentials (not supported).
  • Not all websites have a sitemap. The actor will fail gracefully if /sitemap.xml returns a 404.
  • Some sitemaps are dynamically generated. Sitemaps rendered by JavaScript (not returned as raw XML) are not supported; the actor only parses static XML responses.
  • lastmod, changefreq, and priority are optional fields in the sitemap spec. Many sites omit them; those fields will be null in the output.
  • Auto-detection only tries /sitemap.xml. Some sites place their sitemap at a custom path declared in robots.txt (e.g. /sitemap_index.xml, /sitemaps/main.xml). In those cases, provide the direct sitemap URL as input.

Frequently Asked Questions

Q: What if the website doesn't have a sitemap?
The actor will log an error for that URL and move on. No data will be output for that domain. You can verify manually by visiting https://yourdomain.com/sitemap.xml in your browser.

Q: Can I scrape multiple websites in one run?
Yes. Add multiple entries to startUrls. All results are combined into a single dataset, with each record tagged with its sourceSitemap URL so you can distinguish them.

Q: How deep does recursive traversal go?
Unlimited. The actor follows sitemap index files at any nesting depth until it reaches leaf URL sets or hits the maxUrls limit.

Q: What does maxUrls limit exactly?
It limits the total number of <url> entries extracted and saved to the dataset. It does not limit the number of sitemap XML files fetched.

Q: Do I need a proxy?
For most public websites, no proxy is needed. However, large or popular sites (major news outlets, e-commerce platforms) may rate-limit repeated requests. Apify Proxy is enabled by default and recommended for robustness.

Q: Can I use this to feed a LangChain or LlamaIndex pipeline?
Yes — export the dataset as JSON or JSONL and use the URL list as input to any document loader that accepts a list of URLs. This actor is especially useful as the URL discovery step before running a content scraper.

Q: What happens if a sub-sitemap returns an error?
The failedRequestHandler logs the error and the actor continues with the remaining sitemaps. One failed sub-sitemap does not abort the entire run.

Q: Is the output deduplicated?
Individual sitemap files are not re-fetched (Crawlee's request deduplication prevents this). However, if a URL appears in multiple sitemap files on the same site, it may appear more than once in the output. Post-process with a simple unique by url step if needed.


Technical Details

PropertyValue
RuntimeNode.js 20 (ES Modules)
FrameworkApify SDK v3 + Crawlee
HTTP clientgot-scraping (browser-like headers, proxy support)
XML parsercheerio (XML mode)
Max concurrency10 parallel sitemap fetches
Request timeout30 seconds per sitemap file
Max crawl requests500 XML files per run (configurable)
Memory footprintVery low — no browser, pure HTTP

Changelog

v1.0.0

  • Initial release
  • Recursive sitemap index traversal
  • Auto-detection of sitemap URL from plain domain input
  • Extraction of url, lastmod, changefreq, priority, and sourceSitemap
  • maxUrls limit with early termination
  • Apify Proxy integration
  • JSON, CSV, and Excel export via Apify Dataset

Support

If you encounter issues — sitemaps returning unexpected formats, auto-detection failing, or proxy errors — please open a support ticket via the Apify Console or reach out through the Apify community forum. Include the target URL and the actor run ID in your report.