Sitemap URL Finder
Pricing
from $0.05 / 1,000 results
Sitemap URL Finder
Find and export URLs from any website’s robots.txt and sitemaps. Enter a domain or website URL, optionally filter matching URLs by text, and get clean dataset rows with the URL, domain, path, source sitemap, and match details.
Pricing
from $0.05 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Find URLs from websites without writing crawler code. The Actor starts from website homepages or domains, automatically checks robots.txt and /sitemap.xml, follows sitemap indexes, reads XML and plain text sitemaps, removes duplicate URLs, and saves clean URL inventory rows to the dataset.
Use it for:
- collecting product, category, blog, documentation, or listing URLs before a crawl,
- building URL inventories for SEO, QA, enrichment, or lead workflows,
- finding URLs that contain a simple section path such as
/products/,/blog/, or/platform/, - exporting sitemap URLs with the source sitemap attached.
Input
The normal input has two required choices, plus an optional URL text filter.
{"websites": [{"url": "https://docs.apify.com"}],"maxResults": 25}
Fields
websites: One or more website homepages or domains. The Actor discoversrobots.txtand/sitemap.xmlautomatically.includeUrlText: Optional text that found URLs must contain. Leave it empty to save every sitemap URL.maxResults: Maximum number of URLs to save. The default example uses 25 so scheduled smoke runs finish quickly.
Advanced API users can still pass targetUrlRegex, maxRequestsPerCrawl, proxyConfiguration, or legacy startUrls, but they are not needed for a normal run. Concurrency is fixed internally at a low value for the 256 MB memory tier.
Output
The Actor saves useful URL rows to the default dataset.
{"url": "https://docs.apify.com/platform/actors","domain": "docs.apify.com","path": "/platform/actors","sourceUrl": "https://docs.apify.com/sitemap_base.xml","sourceDomain": "docs.apify.com","filterType": "contains","filterValue": "/platform/","matchedAt": "2026-05-17T08:50:04.000Z"}
The run summary is stored in the key-value store as OUTPUT. It includes the filter used, number of matched URLs, processed sitemap counts, discovered URL counts, and failed request count.
Notes
- Duplicate URLs are filtered with a compact memory-safe deduper.
- Leave
includeUrlTextempty when you want the full sitemap URL list. - The Actor defaults to a cheap 256 MB memory tier.
- Fixed concurrency keeps large gzip sitemap runs inside the cheap memory tier.
- The bundled example input is intentionally small and unfiltered so daily checks return results quickly.
Python API
from apify_client import ApifyClientTOKEN = "YOUR_APIFY_TOKEN"ACTOR_ID = "TheScrapeLab/sitemap-target-url-extractor"apify_client = ApifyClient(TOKEN)actor_client = apify_client.actor(ACTOR_ID)run_input = {"websites": [{"url": "https://docs.apify.com"}],"maxResults": 25,}call_result = actor_client.call(run_input=run_input)if call_result is None:raise RuntimeError("Actor run failed")dataset_client = apify_client.dataset(call_result.default_dataset_id)items = dataset_client.list_items().itemsfor item in items:print(item["url"], item["sourceUrl"])