Public Sitemap & Crawl Scope Planner avatar

Public Sitemap & Crawl Scope Planner

Pricing

from $5.00 / 1,000 useful crawl scope results

Go to Apify Store
Public Sitemap & Crawl Scope Planner

Public Sitemap & Crawl Scope Planner

Turn public robots.txt and sitemap XML into crawl-scope briefs with URL inventory, path groups, seed URLs, freshness, diagnostics, and useful-result pricing.

Pricing

from $5.00 / 1,000 useful crawl scope results

Rating

0.0

(0)

Developer

jack su

jack su

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Turn public robots.txt and sitemap XML into a bounded, AI-ready site index: sitemap URLs, sampled declared pages, page-type counts, path groups, recommended seed URLs, lastmod freshness, robots rules, source evidence, confidence, and diagnostics.

This Actor is for SEO, RAG preparation, AI-agent browsing, monitoring, and content operations teams that need a predictable URL inventory before deciding whether to crawl or process a website.

What It Returns

  • Sanitized site origin and robots URL
  • Same-site sitemap files read
  • Declared sitemap URL count
  • Sampled public URLs with lastmod, changefreq, priority, and page type
  • Page type counts
  • Path groups and recommended seed URLs
  • Latest lastmod value
  • Freshness summary
  • Robots user agents, allow/disallow samples, crawl delay, and rule count
  • Site index hash
  • Evidence URLs
  • Confidence, completeness, missing fields, diagnostics, and readable errors

Pricing Design

The intended pay-per-event setup is:

  • apify-actor-start: a tiny run-start fee
  • useful-crawl-scope-result: charged only for useful public crawl-scope records with at least one same-site URL entry
  • no apify-default-dataset-item

Robots-only, missing-sitemap, duplicate, private-network, invalid-input, and failed records should not charge the useful crawl-scope event.

Good Fits

  • Planning crawl scope before a crawl
  • Feeding RAG pipelines with sitemap URL candidates
  • Checking whether a website exposes usable sitemap metadata
  • SEO/content inventory triage
  • Building agent-friendly site maps without rendering pages

Boundaries

This Actor does not crawl arbitrary pages, log in, use cookies, render JavaScript, take screenshots, scrape search engines, scrape social platforms, or enrich private persons. It reads only public same-site robots.txt and sitemap resources. Credentials, query parameters, fragments, private-network addresses, localhost, .local, external sitemap files, and external page URLs are rejected, skipped, or safely redacted.