Sitemap URL Finder
Pricing
from $0.05 / 1,000 results
Sitemap URL Finder
Find and export URLs from any website’s robots.txt and sitemaps. Enter a domain or website URL, optionally filter matching URLs by text, and get clean dataset rows with the URL, domain, path, source sitemap, and match details.
Pricing
from $0.05 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Maintained by CommunityActor stats
0
Bookmarked
3
Total users
2
Monthly active users
12 days ago
Last modified
Categories
Share
Sitemap URL Finder extracts URLs from website sitemaps and robots.txt for SEO teams, data teams, QA teams, and crawler builders who need a clean URL inventory before a larger crawl.
At a glance: input examples, output examples, use cases, limitations, troubleshooting, and pricing/cost guidance are included below for small URL inventory checks and recurring sitemap monitoring.
Enter one or more domains or website URLs. The Actor checks robots.txt, discovers sitemap files, follows sitemap indexes, reads XML, plain-text, and gzip sitemaps, removes duplicate URLs, and saves ready-to-export rows in the dataset.
Use Cases
- Build a URL inventory for SEO audits, migrations, QA checks, or crawl planning.
- Find product, category, blog, documentation, listing, or support URLs before scraping page details.
- Export sitemap URLs with the source sitemap attached for downstream workflows.
- Filter sitemap results to one section, such as
/products/,/blog/,/docs/, or/store/. - Prepare URL lists for content crawlers, monitoring, enrichment, RAG ingestion, or lead workflows.
What Data You Get
Each dataset row contains:
url: URL found in a sitemap.domain: hostname of the found URL.path: path part of the URL.sourceUrl: sitemap or robots-discovered file where the URL was found.sourceDomain: hostname of the source sitemap.filterType:all,contains, orregex.filterValue: text or regular expression used for matching.matchedRegex: regular expression used, when provided through the API.lastmod: optional last modified value from the sitemap entry.changefreq: optional change frequency value from the sitemap entry.priority: optional priority value from the sitemap entry.matchedAt: UTC timestamp when the URL was saved.
The run output also links to a summary record with processed sitemap counts, discovered URL counts, saved result count, filter details, and failed request count.
Input
Use websites for normal runs. You can enter domains, homepages, or site sections; the Actor normalizes each value to the website origin and discovers common sitemap locations automatically.
{"websites": [{"url": "https://docs.apify.com"}],"includeUrlText": "/platform/","maxResults": 25}
Main Settings
websites: Website homepages or domains to scan. The Actor checksrobots.txtand/sitemap.xmlfor each website.includeUrlText: Optional text that found URLs must contain. Leave it empty to save every sitemap URL.maxResults: Maximum number of URL rows to save.
Optional API Settings
maxRequestsPerCrawl: Safety cap for robots.txt and sitemap files fetched in one run.targetUrlRegex: API-only regular expression filter. It takes precedence overincludeUrlText.websiteUrlsandstartUrls: Legacy/API aliases for existing integrations.
Example Output
{"url": "https://docs.apify.com/platform/actors","domain": "docs.apify.com","path": "/platform/actors","sourceUrl": "https://docs.apify.com/sitemap_base.xml","sourceDomain": "docs.apify.com","lastmod": "2026-06-10","changefreq": "weekly","priority": "0.8","filterType": "contains","filterValue": "/platform/","matchedAt": "2026-06-11T13:55:51.865Z"}
How To Run
- Open the Actor in Apify Console.
- Add one or more websites in the Input tab.
- Optionally set
URL containsif you only want one site section. - Set
Max resultsto control dataset size and cost. - Start the run and open the Dataset tab when it finishes.
Results are pushed to the dataset while the Actor runs, so partial results can still be useful if a long run is stopped or times out.
Exporting Results
After a run, open the Dataset tab and export results as JSON, CSV, Excel, XML, RSS, or HTML. API users can read the default dataset from the run response.
Python API Example
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("thescrapelab/sitemap-target-url-extractor").call(run_input={"websites": [{"url": "https://docs.apify.com"}],"includeUrlText": "/platform/","maxResults": 25,})if run is None:raise RuntimeError("Actor run failed")items = client.dataset(run["defaultDatasetId"]).list_items().itemsfor item in items:print(item["url"], item["sourceUrl"])
Limits And Caveats
- The Actor reads sitemap files; it does not crawl every HTML page to discover links.
- Some websites do not publish complete or valid sitemaps.
- Password-protected, blocked, or private sitemaps may return no results.
- Very large sites can contain many sitemap indexes. Use
maxResultsandmaxRequestsPerCrawlto keep runs predictable. - Sitemap metadata is included only when the website provides it. Image, video, and alternate-language sitemap extensions are not currently included.
Troubleshooting
No results were found. The website may not publish sitemap URLs, or your URL contains filter may be too narrow. Try leaving the filter empty.
The run finished quickly with failed request counts. Some sitemap URLs returned permanent errors such as 404. The Actor skips those instead of wasting retries.
The output has fewer rows than expected. Check maxResults, maxRequestsPerCrawl, and any filter value. Also confirm the website sitemap actually lists the URLs you expect.
The run is slow on a large website. Keep the default 256 MB memory for most runs, raise maxResults gradually, and use maxRequestsPerCrawl to keep very large sitemap indexes predictable.
Pricing
The recommended pricing model is pay per result with a very small Actor start event. This keeps small tests inexpensive and makes larger runs scale with the number of useful URLs returned. Platform usage is low because the Actor uses lightweight HTTP requests instead of a browser and defaults to the 256 MB memory tier.
FAQ
Can this extract all URLs from a sitemap?
Yes. Leave URL contains empty and set Max results high enough for the website.
Can it find sitemap URLs from robots.txt?
Yes. The Actor checks robots.txt, follows sitemap directives, and also tries /sitemap.xml.
Can it parse sitemap indexes?
Yes. It follows nested sitemap indexes until the request or result limits are reached.
Does it support gzip sitemaps?
Yes. Gzip-compressed sitemap responses are decompressed before parsing.
Can I filter only product or blog URLs?
Yes. Use URL contains with a path fragment such as /products/, /blog/, /category/, or /docs/.
Is this a full website crawler?
No. It extracts URLs listed in sitemaps. Use a full web crawler if you need to discover links from page HTML.