Pricing

from $0.99 / 1,000 results

Sitemap Finder & URL Extractor · Crawl Any XML Sitemap

Find and crawl XML sitemaps from any website. Follows sitemap indexes, handles gzip, and exports every page URL with source file and lastmod into a clean dataset. No config needed.

Pricing

from $0.99 / 1,000 results

Rating

0.0

(0)

Developer

Corentin Robert

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Support & contact

Need help? If something’s unclear, a run fails in an unexpected way, or you’d like a small customization, you can email Corentin Robert (corentin@outreacher.fr).

What you get

One row per URL in the default dataset, ready to export (JSON, CSV, API).
Automatic discovery from each site’s origin: robots.txt Sitemap: directives plus common paths (/sitemap.xml, WordPress-style paths, etc.).
Nested sitemap indexes followed for you (up to 10 levels deep—fixed in code).
Live progress in RUN_LOG, plus a per-site summary at the end (URL count and sitemap fetch count per input origin).

How it works (short)

You add one start URL per site (any page on that domain—the actor only uses the origin).
It resolves entry sitemap URLs, then fetches each file (HTTP 200 only).
It understands both urlset (URLs inline) and sitemapindex (pointers to more sitemaps), then flattens everything into page URLs.

Many sites use a single urlset; larger sites often ship a root sitemapindex that fans out to multiple urlset files—both patterns are supported.

Quick start

In the Apify console, open Site URLs and paste one URL per site (bulk paste or file import works).
Start the run. That’s it—timeouts, pacing, and headers are fixed defaults (see below).

Input

Field	What to enter
startUrls	One URL per website (homepage, blog article, anything on that domain). Same domain twice → one logical “site” in the summary.

Output (dataset)

Field	Meaning
`url`	Page URL from `<loc>`
`sourceSitemapUrl`	The `urlset` file this row came from
`lastmod`	From the sitemap if present, otherwise `null`
`discoveredFrom`	`robots` (from robots.txt), `candidate` (default paths), or `nested` (via an index)
`fetchedAt`	ISO timestamp when the row was written

Run log

During the run, RUN_LOG in the key-value store stays compact—for example:

12,400 urls · 15 sitemaps

Failed fetches appear as ERR: lines (timeout, non-200, or body that isn’t XML)—without dumping raw responses.

When the run finishes, you’ll see Per input site (origin): with a line per distinct origin (URLs collected + sitemaps fetched). The same breakdown is emitted to Apify’s Log tab via log.info.

Built-in defaults (not shown in the UI)

Topic	Value
HTTP timeout	60 s per request
Pause between sitemap requests	100 ms
Extra sitemap paths	None beyond the built-in discovery list
`robots.txt`	Read during discovery
User-Agent	`SitemapDetector/1.0 (+https://apify.com)`

Limits & good citizenship

Index depth is capped at 10 nested sitemapindex levels (by design).
Endpoints that return non-XML (e.g. JSON “sitemap APIs”) are skipped—you’ll see not XML in the log.
Very large sitemaps mean longer runs and higher memory use on Apify; that’s expected.
Use this on sites you’re allowed to query; it only requests public sitemap URLs.

Example input

One site

{
  "startUrls": [{ "url": "https://example.com/" }]
}

Several sites in one run

{
  "startUrls": [
    { "url": "https://apify.com/" },
    { "url": "https://vercel.com/docs" }
  ]
}

Testing & deploy

Smoke tests (parser + live fetches, no full Actor container):

$npm run test:smoke

Uses Apify, Doctolib, and Vercel sitemaps as fixtures (the script hits sitemap.xml URLs directly; the Actor itself only needs start URLs).

Run the Actor locally

$apify run --input-file tests/manual-runs/smoke-apify-actor.json

More samples: tests/manual-runs/.

Publish — bump .actor/actor.json version to MAJOR.MINOR (e.g. 1.7), not npm-style 1.0.13, then:

$apify push

Robots.txt Auditor & Sitemap Finder

andok/robotstxt-auditor

Scan robots.txt files in bulk to extract sitemap URLs and verify crawler directives for technical SEO compliance.

Andok

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Sitemap Audit

apage/sitemap-audit

Get a Sitemap Health Score (0-100) for any website. Discover, parse, and validate XML sitemaps. Find 404s, redirects, canonical mismatches, noindex conflicts, hreflang issues, missing pages and estimate crawl budget waste.

Andy Page

G2 Company Scraper

ryanclinton/g2-company-scraper

ryan clinton

Agency Directory Scraper

ryanclinton/agency-directory-scraper

ryan clinton

Sitemap Detector

coder_zoro/sitemap-detector

Find sitemap URLs fast with our free Sitemap Finder tool. Instantly detect sitemaps from any website for SEO audits, indexing checks, and crawl planning. Improve visibility, site structure insights, and search engine performance in just seconds

Zoro

166

5.0

(3)

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

210

1.0

(1)

Sitemap to URL Crawler

logiover/sitemap-to-url-crawler

nstantly extract all public URLs from any website's sitemap.xml recursively. Handles nested sitemap indexes automatically. The fastest & cheapest way to build URL lists for RAG pipelines, LLM training, and SEO audits. Zero-config & blazing fast.

Logiover

Sitemap Url Extractor

scrapers-hub/sitemap-url-extractor

Sitemap URL extractor to extract all URLs from XML sitemaps quickly and efficiently 🌐📄 Ideal for SEO audits, site analysis, and indexing workflows. Fast, accurate, and easy to use.

Scrapers Hub

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a website link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).