Sitemap URL Finder avatar

Sitemap URL Finder

Pricing

from $0.05 / 1,000 results

Go to Apify Store
Sitemap URL Finder

Sitemap URL Finder

Find and export URLs from any website’s robots.txt and sitemaps. Enter a domain or website URL, optionally filter matching URLs by text, and get clean dataset rows with the URL, domain, path, source sitemap, and match details.

Pricing

from $0.05 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Sitemap URL Finder logo

Find URLs from websites without writing crawler code. The Actor starts from website homepages or domains, automatically checks robots.txt and /sitemap.xml, follows sitemap indexes, reads XML and plain text sitemaps, removes duplicate URLs, and saves clean URL inventory rows to the dataset.

Use it for:

  • collecting product, category, blog, documentation, or listing URLs before a crawl,
  • building URL inventories for SEO, QA, enrichment, or lead workflows,
  • finding URLs that contain a simple section path such as /products/, /blog/, or /platform/,
  • exporting sitemap URLs with the source sitemap attached.

Input

The normal input has two required choices, plus an optional URL text filter.

{
"websites": [
{
"url": "https://docs.apify.com"
}
],
"maxResults": 25
}

Fields

  • websites: One or more website homepages or domains. The Actor discovers robots.txt and /sitemap.xml automatically.
  • includeUrlText: Optional text that found URLs must contain. Leave it empty to save every sitemap URL.
  • maxResults: Maximum number of URLs to save. The default example uses 25 so scheduled smoke runs finish quickly.

Advanced API users can still pass targetUrlRegex, maxRequestsPerCrawl, proxyConfiguration, or legacy startUrls, but they are not needed for a normal run. Concurrency is fixed internally at a low value for the 256 MB memory tier.

Output

The Actor saves useful URL rows to the default dataset.

{
"url": "https://docs.apify.com/platform/actors",
"domain": "docs.apify.com",
"path": "/platform/actors",
"sourceUrl": "https://docs.apify.com/sitemap_base.xml",
"sourceDomain": "docs.apify.com",
"filterType": "contains",
"filterValue": "/platform/",
"matchedAt": "2026-05-17T08:50:04.000Z"
}

The run summary is stored in the key-value store as OUTPUT. It includes the filter used, number of matched URLs, processed sitemap counts, discovered URL counts, and failed request count.

Notes

  • Duplicate URLs are filtered with a compact memory-safe deduper.
  • Leave includeUrlText empty when you want the full sitemap URL list.
  • The Actor defaults to a cheap 256 MB memory tier.
  • Fixed concurrency keeps large gzip sitemap runs inside the cheap memory tier.
  • The bundled example input is intentionally small and unfiltered so daily checks return results quickly.

Python API

from apify_client import ApifyClient
TOKEN = "YOUR_APIFY_TOKEN"
ACTOR_ID = "TheScrapeLab/sitemap-target-url-extractor"
apify_client = ApifyClient(TOKEN)
actor_client = apify_client.actor(ACTOR_ID)
run_input = {
"websites": [{"url": "https://docs.apify.com"}],
"maxResults": 25,
}
call_result = actor_client.call(run_input=run_input)
if call_result is None:
raise RuntimeError("Actor run failed")
dataset_client = apify_client.dataset(call_result.default_dataset_id)
items = dataset_client.list_items().items
for item in items:
print(item["url"], item["sourceUrl"])