Sitemap Scraper
Pricing
from $5.99 / 1,000 results
Sitemap Scraper
Sitemap Scraper extracts URLs, page metadata, update dates, images, and structured sitemap data from XML sitemaps. Ideal for SEO audits, website analysis, content discovery, indexing validation, competitor research, and large-scale web data collection.
Pricing
from $5.99 / 1,000 results
Rating
0.0
(0)
Developer
ScrapeVanta
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Sitemap Scraper ⚡
If you need to extract URLs from a website sitemap but don’t want to manually download and parse XML files, you’re stuck with tedious copy-paste work. Sitemap Scraper automatically crawls sitemap URLs (including sitemap indexes) and saves every discovered URL to an Apify dataset. It’s a practical sitemap parser for SEO sitemap auditing, sitemap link scraping, and automated sitemap crawling. Built for SEO specialists, data analysts, and researchers who need URL lists fast at scale. In one run, you can process multiple sitemap URLs and get structured results without writing any code.
See the Data: Sample Output
Here's a real record from a single run:
{"url": "https://blog.apify.com/sitemap-articles.xml","lastMod": "2026-05-28"}
The actor writes extracted URL records to the dataset as objects like the one above.
| Field | Type | What It Tells You |
|---|---|---|
url | string | The discovered URL extracted from the sitemap or sitemap index. |
lastMod | string | null | The lastmod date (first 10 characters) when present—useful for freshness checks in your SEO sitemap scraper workflow. |
status | string | null | Signals whether the run produced data successfully for the parsed sitemap content (when available in your dataset export). |
error_message | string | null | Captures any parsing or fetching issues you may see reflected in dataset-related reporting (when available in your dataset export). |
baseUrl | string | null | Helps you trace which input sitemap URL the extracted items came from (when included in your dataset export view). |
source | string | null | Indicates the extraction context in your export tooling (when included in your dataset export view). |
timestamp | string | null | When present in your export, shows when the record was produced. |
runId | string | null | Useful for tracking records back to a specific actor run (when present in your export tooling). |
taskId | string | null | Helps correlate records with a particular processing unit (when present in your export tooling). |
success | boolean | null | Indicates whether the extraction for a sitemap succeeded (when present in your export tooling). |
warning | string | null | Any warning text captured by your export layer (when present). |
raw | object | null | If your export includes raw XML-derived fields, this may contain them (when present). |
Export your full dataset as JSON, CSV, or Excel from the Apify dashboard.
Setting It Up
Drop this into your input.json and you're ready to go:
{"startUrls": [{ "url": "https://blog.apify.com/sitemap.xml" }]}
| Parameter | Required | What It Does |
|---|---|---|
startUrls | ✅ | A list of sitemap URLs to crawl (it can include a sitemap index, which the actor will handle recursively). |
What It Does
Sitemap Scraper downloads sitemap XML content from your provided sitemap URLs, parses it, and saves extracted URLs into a dataset.
Extract at Scale with Sitemap Scraper
Provide one or more sitemap URLs in startUrls, and the actor processes each one to extract all url entries. If a sitemap is actually a sitemap index, it automatically fetches and parses the nested sitemaps as well—so you get a complete URL list instead of partial results.
Works for Sitemap Indexes (Not Just Simple Sitemaps)
Many websites publish sitemap indexes that point to multiple sub-sitemaps. Sitemap Scraper is designed to recognize that structure and continue crawling through sub-sitemaps until it reaches regular urlset content.
Clean URL Output for SEO Audits
The dataset records include url and lastMod (taken from the sitemap’s lastmod value, truncated to the first 10 characters when present). That makes the output especially useful for sitemap parser workflows like SEO sitemap scraper tools, freshness checks, and sitemap link scraping.
Resilient Fetching with Retries
When fetching sitemap XML, the actor uses a maximum of 3 retries and includes error handling for common HTTP failure modes. It’s built to be dependable across real-world public web data.
URL Extraction from Public Web Data
Sitemap Scraper focuses on publicly available sitemap XML content. It’s a straightforward sitemap scraper tool when you need automated sitemap crawling without manual downloads and parsing.
Overall, Sitemap Scraper turns sitemap URL extraction into a one-click dataset you can export and analyze immediately.
Why Sitemap Scraper?
There are plenty of ways to pull data from sitemap XML files—here’s why Sitemap Scraper stands out.
Handles Sitemap Indexes Automatically
Instead of stopping at the first sitemap file, this tool continues into nested sitemap indexes. That means fewer gaps in your sitemap URL extraction results when you’re building a blog sitemap scraper or doing SEO sitemap auditing.
Ready-to-Analyze Output
The actor saves structured records to a dataset, including url and lastMod. This makes it easy to plug extracted URLs into downstream workflows—whether you’re doing a sitemap scraping software review or assembling a sitemap scraper for SEO auditing.
Built for Bulk, Not Busywork
You can supply multiple sitemap URLs in startUrls, then let the actor do the heavy lifting. For teams doing bulk sitemap url scraper tasks, this removes hours of manual parsing and keeps your process repeatable.
Real-World Use Cases
Here's how different teams put Sitemap Scraper to work:
SEO Teams
When an SEO audit needs a complete inventory of pages, you can run Sitemap Scraper with your site’s main sitemap URL(s) and get a clean dataset of discovered URLs. The lastMod field helps you spot freshness patterns quickly for sitemap-based crawling and auditing.
Content & Publishing Ops
For blog sitemap scraper workflows, you often want visibility into which sections are present and how often they update. Use Sitemap Scraper to extract URLs from the full sitemap structure (including indexes) and keep your internal content lists aligned.
Data Analysts
If you’re correlating URLs with performance metrics, you need a reliable baseline list of links. Sitemap Scraper gives you an export-friendly dataset that you can join with analytics data—no custom sitemap parser needed.
Automation & Developer Workflows
When you want to schedule automated sitemap crawling, you can integrate the actor into your pipeline and treat the output as a consistent source of truth. The dataset output works well as input to ETL jobs, monitoring scripts, and regular SEO refresh cycles.
How to Run It
No code required. Here's how to get your first results in under 5 minutes:
- Open the actor on Apify — go to the actor page on console.apify.com.
- Enter your inputs — add your sitemap URLs under
startUrls(each item should contain aurl). - Configure proxy settings (optional) — if your setup requires it, enable the provided proxy configuration options for better reliability.
- Start the run and watch the live log — track sitemap fetching progress as it processes each start URL.
- Open the Dataset tab — extracted
url(andlastModwhen available) records appear as they’re pushed. - Export in your preferred format — download from the Apify dataset tab as JSON, CSV, or Excel.
The whole setup takes under 5 minutes — results start appearing within seconds of launch.
Export & Integration Options
Once your data is collected, Sitemap Scraper fits directly into your existing workflow.
You can export results from the Apify dataset tab as JSON, CSV, or Excel for quick sharing and analysis. If you’re building a dashboard or running scripts, JSON is a convenient format for programmatic ingestion.
For integrations, you can use Apify’s API access to pull results into your systems, or connect to automation tools like Zapier / Make to push extracted URLs into your next step. You can also schedule runs so automated sitemap crawling happens regularly, without manual effort.
Pricing
Sitemap Scraper runs on Apify, which includes a free tier — no credit card needed to start. Free tier usage includes $5 platform credits on sign-up, which is typically enough for several real test runs. After that, runs are generally pay-as-you-go based on Apify compute units (CU), so you only spend when you execute the actor. For heavier workloads and ongoing monitoring, check Apify’s plans and pricing on the pricing page.
Start free at apify.com — scale up when you need to.
Reliability & Limitations
| What We Handle | How |
|---|---|
| Retries for fetching sitemaps | Up to 3 retries with error handling and backoff logic |
| Redirects | Follow redirects enabled for sitemap fetching |
| Sitemap indexes | Recursively parses sitemap indexes until it reaches URL sets |
| Unknown or unexpected XML | Logs warnings when the root tag is not recognized |
| Partial failures | If a sitemap fails to fetch, processing continues for other provided start URLs |
Limitations: Sitemap Scraper works with publicly accessible sitemap XML content. It does not bypass authentication or process private, login-gated, or otherwise restricted resources. If a sitemap returns malformed XML or is inaccessible due to server-side restrictions, you may see missing outputs for those specific inputs.
For enterprise-scale needs or custom configurations, reach out and we'll help.
Frequently Asked Questions
Is there a free plan?
Yes, Apify offers a free tier so you can run Sitemap Scraper and test the output before scaling up.
Do I need to log in or create an account on Apify to use this?
No—you can run Sitemap Scraper from the Apify interface once you have access to the actor page. To trigger the actor via the Apify API, you’ll use your Apify account credentials.
How accurate is the extracted data?
The actor extracts URLs that are present in the sitemap XML you provide. It parses urlset entries and handles sitemap indexes recursively, so accuracy depends on the sitemap content published by the website owner.
How many results can I get per run?
There’s no input-only limit in the provided actor schema. The number of records you get depends on how many URLs are contained in the sitemap(s) referenced by your startUrls.
How fresh is the data?
Freshness depends on when the website updates its sitemap and the time you run Sitemap Scraper. The optional lastMod field helps you understand the sitemap’s own reported update date.
Is this legal? Does it comply with GDPR / CCPA?
Sitemap Scraper focuses on publicly available data inside sitemap XML files that can be accessed without special credentials. You’re responsible for ensuring your use complies with GDPR, CCPA, and applicable laws.
Can I export to Google Sheets or Excel?
Yes. You can export from the Apify dataset tab as JSON, CSV, or Excel, then move the data into Google Sheets or Excel workflows.
Can I schedule this to run automatically?
Yes. You can schedule actor runs on Apify for automated sitemap crawling so your URL lists stay up to date.
Can I access results via the API?
Yes. You can trigger runs and retrieve results programmatically via the Apify API.
What happens when the actor encounters an error?
When sitemap fetching or parsing fails, the actor logs errors and warnings and continues processing other provided inputs. For specific sitemap URLs that can’t be fetched after retries, you may see fewer or no extracted records for those inputs.
Get Help & Use Responsibly
Got a question about Sitemap Scraper or a feature you'd like added? Reach out at dataforleads@gmail.com. We’re happy to help with setup questions and are open to ideas like adding richer dataset metadata or supporting additional export-friendly structures.
Sitemap Scraper works with publicly available data from sitemap XML files. It does not access private accounts, login-gated pages, or password-protected content. You’re responsible for compliance with GDPR, CCPA, and any relevant platform terms. For data-removal requests, contact dataforleads@gmail.com. Use responsibly, ethically, and only for lawful purposes.