Sitemap to URL Crawler: RAG & AI Data Feeder avatar
Sitemap to URL Crawler: RAG & AI Data Feeder

Pricing

from $0.50 / 1,000 results

Go to Apify Store
Sitemap to URL Crawler: RAG & AI Data Feeder

Sitemap to URL Crawler: RAG & AI Data Feeder

nstantly extract all public URLs from any website's sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest & cheapest way to build URL lists for RAG pipelines, LLM training, and SEO audits. Zero-config & blazing fast.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Logiover Data

Logiover Data

Maintained by Community

Actor stats

1

Bookmarked

1

Total users

0

Monthly active users

a day ago

Last modified

Share

🗺️ Sitemap to URL Generator — Extract All URLs from sitemap.xml (RAG, SEO, AI Agents)

Instantly extract all public URLs from any website sitemap.
The perfect first step for RAG pipelines, LLM training datasets, AI browsing agents, and SEO audits.

Before you can scrape content, you need a complete list of pages. This Actor recursively crawls sitemap.xml (including nested Sitemap Index files) and produces a clean, structured URL dataset with metadata like lastmod, changefreq, and priority.


✅ What this Actor does (in one sentence)

Given a domain or sitemap URL, it discovers and crawls all sitemaps, extracts every public page URL, and outputs a deduplicated URL list with SEO metadata.


🚀 Why use this Actor?

Use CaseHow it helps
RAG PipelinesGenerate a complete URL list to feed into “Website to Markdown” or your vector DB pipeline.
AI AgentsGive your agent a full map of a domain so it can browse and retrieve relevant pages.
SEO AuditsExtract sitemap metadata (last modified date, change frequency, priority) at scale.
Competitor AnalysisSee every blog post, product page, category page, or landing page a competitor publishes.
Content MonitoringDetect new pages by comparing lastmod across runs.
Site Migration / QAValidate sitemap coverage and identify missing sections quickly.

✨ Features

  • Recursive Sitemap Crawling
    Automatically handles Sitemap Index files and nested sitemap references.
  • Smart Auto-Discovery
    Paste a domain like https://example.com and it will try https://example.com/sitemap.xml.
  • SEO Metadata Extraction
    Extracts: lastmod, changefreq, priority (when available).
  • Blazing Fast & Cost Efficient
    Uses raw HTTP requests (no browser overhead).
  • Proxy Ready
    Supports Apify Proxy to reduce IP blocks and request throttling.

🧠 Common sitemap formats supported

This Actor is designed to handle typical sitemap patterns used by:

  • WordPress / WooCommerce
  • Shopify stores
  • Headless CMS sites
  • News publishers
  • Large e-commerce catalogs with sitemap indexes
  • Multi-sitemap architectures (sitemap-posts.xml, sitemap-products.xml, etc.)

🛠 How to Use

  1. Add one or more Start URLs:
    • a domain: https://www.nytimes.com
    • or a direct sitemap: https://example.com/sitemap.xml
  2. Set Max URLs (optional) to cap output size on huge sites.
  3. (Recommended) Enable Proxy Configuration to avoid throttling.
  4. Run the Actor and export results as JSON/CSV/Excel.

⚙️ Input Configuration

This Actor uses the following input schema in Apify:

startUrls (required)

A list of domains or sitemap links:

  • https://example.com
  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml

maxUrls (optional)

Limit the number of extracted URLs (default: 10000).

Use Apify Proxy to reduce blocks and increase reliability.

Example Input (JSON)

{
"startUrls": [
{ "url": "https://example.com" }
],
"maxUrls": 10000,
"proxyConfiguration": {
"useApifyProxy": true
}
}

📦 Output (Dataset)

The Actor produces a dataset called “Sitemap URLs” with a clean schema:

url — Page URL extracted from sitemap

lastmod — Last modified date (if provided)

changefreq — Change frequency (if provided)

priority — Sitemap priority score (if provided)

Output Example (JSON) { "url": "https://example.com/blog/ai-trends-2026", "lastmod": "2026-01-10", "changefreq": "weekly", "priority": 0.8 }

📊 Dataset View (URL List)

This Actor includes a dataset table view in Apify with:

URL (clickable link)

Last Modified (date)

Change Freq (text)

Priority (number)

This is ideal for quick validation before exporting or passing results into the next pipeline step.

🔗 Recommended pipeline (SEO + AI data factory)

This Actor is the discovery layer for your AI content ingestion pipeline:

Sitemap to URL Generator → get all pages

Website to Markdown (RAG Ready) → convert pages to clean Markdown

Vector DB ingestion (Pinecone / Qdrant / Weaviate / OpenAI Vector Store)

AI Agent / RAG app → retrieval + reasoning

This modular workflow is more reliable and cheaper than crawling blindly.

🔥 Pro Tips (maximize coverage + minimize cost)

  1. Start with the domain, not a guessed sitemap path

If you only know the domain, input https://example.com. The Actor will attempt to auto-discover the sitemap entry point.

  1. Use maxUrls to control cost

Large e-commerce sites can have hundreds of thousands of URLs. Start with:

maxUrls = 5000 for a first scan

increase gradually once your pipeline is stable

  1. Use metadata for incremental updates

Many sites update lastmod. You can:

run daily

filter pages updated after your last run

scrape only the new/changed pages downstream

  1. Segment by sitemap source (optional workflow)

If your implementation includes sourceSitemap, you can track which sitemap produced each URL, enabling category-specific scrapes (posts vs products vs pages).

🧯 Troubleshooting

I get blocked / throttled

Enable proxyConfiguration.useApifyProxy = true

Reduce total extracted URLs per run with maxUrls

Run in batches and merge results

The sitemap is missing URLs I can see on the site

Not all sites list every page in sitemaps. This is common. For full discovery, use a hybrid strategy:

sitemap URLs (high precision)

plus internal link crawl (high recall) via a crawler actor

Some fields are empty (lastmod, priority, changefreq)

These fields are optional in the sitemap spec. Many sites omit them.

💰 Pricing & Efficiency

This Actor is optimized for speed and low compute usage:

No browser automation

Minimal overhead

High throughput on large sitemap indexes

🎯 SEO Keywords (what this Actor is built for)

If you are searching for:

sitemap scraper

sitemap.xml URL extractor

extract URLs from sitemap

sitemap index crawler

SEO audit sitemap

RAG pipeline URL discovery

LLM training URL list

competitor sitemap analysis

website URL generator

This Actor is designed to be the fastest, cleanest solution on Apify for that job.

🗺 Roadmap

Planned improvements:

robots.txt sitemap discovery support

optional sourceSitemap field per URL

built-in dedup + canonicalization controls

delta mode (only new/updated URLs since last run)

Support & Feedback

If you have feature requests or encounter an issue, open a ticket in the Actor page. Include:

your start URL

a sample sitemap URL

the approximate size (URL count) you expect