Sitemap to URL Crawler: RAG & AI Data Feeder
Pricing
from $0.50 / 1,000 results
Sitemap to URL Crawler: RAG & AI Data Feeder
nstantly extract all public URLs from any website's sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest & cheapest way to build URL lists for RAG pipelines, LLM training, and SEO audits. Zero-config & blazing fast.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer

Logiover Data
Actor stats
1
Bookmarked
1
Total users
0
Monthly active users
a day ago
Last modified
Categories
Share
🗺️ Sitemap to URL Generator — Extract All URLs from sitemap.xml (RAG, SEO, AI Agents)
Instantly extract all public URLs from any website sitemap.
The perfect first step for RAG pipelines, LLM training datasets, AI browsing agents, and SEO audits.
Before you can scrape content, you need a complete list of pages. This Actor recursively crawls sitemap.xml (including nested Sitemap Index files) and produces a clean, structured URL dataset with metadata like lastmod, changefreq, and priority.
✅ What this Actor does (in one sentence)
Given a domain or sitemap URL, it discovers and crawls all sitemaps, extracts every public page URL, and outputs a deduplicated URL list with SEO metadata.
🚀 Why use this Actor?
| Use Case | How it helps |
|---|---|
| RAG Pipelines | Generate a complete URL list to feed into “Website to Markdown” or your vector DB pipeline. |
| AI Agents | Give your agent a full map of a domain so it can browse and retrieve relevant pages. |
| SEO Audits | Extract sitemap metadata (last modified date, change frequency, priority) at scale. |
| Competitor Analysis | See every blog post, product page, category page, or landing page a competitor publishes. |
| Content Monitoring | Detect new pages by comparing lastmod across runs. |
| Site Migration / QA | Validate sitemap coverage and identify missing sections quickly. |
✨ Features
- Recursive Sitemap Crawling
Automatically handles Sitemap Index files and nested sitemap references. - Smart Auto-Discovery
Paste a domain likehttps://example.comand it will tryhttps://example.com/sitemap.xml. - SEO Metadata Extraction
Extracts:lastmod,changefreq,priority(when available). - Blazing Fast & Cost Efficient
Uses raw HTTP requests (no browser overhead). - Proxy Ready
Supports Apify Proxy to reduce IP blocks and request throttling.
🧠 Common sitemap formats supported
This Actor is designed to handle typical sitemap patterns used by:
- WordPress / WooCommerce
- Shopify stores
- Headless CMS sites
- News publishers
- Large e-commerce catalogs with sitemap indexes
- Multi-sitemap architectures (
sitemap-posts.xml,sitemap-products.xml, etc.)
🛠 How to Use
- Add one or more Start URLs:
- a domain:
https://www.nytimes.com - or a direct sitemap:
https://example.com/sitemap.xml
- a domain:
- Set Max URLs (optional) to cap output size on huge sites.
- (Recommended) Enable Proxy Configuration to avoid throttling.
- Run the Actor and export results as JSON/CSV/Excel.
⚙️ Input Configuration
This Actor uses the following input schema in Apify:
startUrls (required)
A list of domains or sitemap links:
https://example.comhttps://example.com/sitemap.xmlhttps://example.com/sitemap_index.xml
maxUrls (optional)
Limit the number of extracted URLs (default: 10000).
proxyConfiguration (optional, recommended)
Use Apify Proxy to reduce blocks and increase reliability.
Example Input (JSON)
{"startUrls": [{ "url": "https://example.com" }],"maxUrls": 10000,"proxyConfiguration": {"useApifyProxy": true}}
📦 Output (Dataset)
The Actor produces a dataset called “Sitemap URLs” with a clean schema:
url — Page URL extracted from sitemap
lastmod — Last modified date (if provided)
changefreq — Change frequency (if provided)
priority — Sitemap priority score (if provided)
Output Example (JSON) { "url": "https://example.com/blog/ai-trends-2026", "lastmod": "2026-01-10", "changefreq": "weekly", "priority": 0.8 }
📊 Dataset View (URL List)
This Actor includes a dataset table view in Apify with:
URL (clickable link)
Last Modified (date)
Change Freq (text)
Priority (number)
This is ideal for quick validation before exporting or passing results into the next pipeline step.
🔗 Recommended pipeline (SEO + AI data factory)
This Actor is the discovery layer for your AI content ingestion pipeline:
Sitemap to URL Generator → get all pages
Website to Markdown (RAG Ready) → convert pages to clean Markdown
Vector DB ingestion (Pinecone / Qdrant / Weaviate / OpenAI Vector Store)
AI Agent / RAG app → retrieval + reasoning
This modular workflow is more reliable and cheaper than crawling blindly.
🔥 Pro Tips (maximize coverage + minimize cost)
- Start with the domain, not a guessed sitemap path
If you only know the domain, input https://example.com. The Actor will attempt to auto-discover the sitemap entry point.
- Use maxUrls to control cost
Large e-commerce sites can have hundreds of thousands of URLs. Start with:
maxUrls = 5000 for a first scan
increase gradually once your pipeline is stable
- Use metadata for incremental updates
Many sites update lastmod. You can:
run daily
filter pages updated after your last run
scrape only the new/changed pages downstream
- Segment by sitemap source (optional workflow)
If your implementation includes sourceSitemap, you can track which sitemap produced each URL, enabling category-specific scrapes (posts vs products vs pages).
🧯 Troubleshooting
I get blocked / throttled
Enable proxyConfiguration.useApifyProxy = true
Reduce total extracted URLs per run with maxUrls
Run in batches and merge results
The sitemap is missing URLs I can see on the site
Not all sites list every page in sitemaps. This is common. For full discovery, use a hybrid strategy:
sitemap URLs (high precision)
plus internal link crawl (high recall) via a crawler actor
Some fields are empty (lastmod, priority, changefreq)
These fields are optional in the sitemap spec. Many sites omit them.
💰 Pricing & Efficiency
This Actor is optimized for speed and low compute usage:
No browser automation
Minimal overhead
High throughput on large sitemap indexes
🎯 SEO Keywords (what this Actor is built for)
If you are searching for:
sitemap scraper
sitemap.xml URL extractor
extract URLs from sitemap
sitemap index crawler
SEO audit sitemap
RAG pipeline URL discovery
LLM training URL list
competitor sitemap analysis
website URL generator
This Actor is designed to be the fastest, cleanest solution on Apify for that job.
🗺 Roadmap
Planned improvements:
robots.txt sitemap discovery support
optional sourceSitemap field per URL
built-in dedup + canonicalization controls
delta mode (only new/updated URLs since last run)
Support & Feedback
If you have feature requests or encounter an issue, open a ticket in the Actor page. Include:
your start URL
a sample sitemap URL
the approximate size (URL count) you expect