Site Governance Monitor | Robots, Sitemap & Schema avatar

Site Governance Monitor | Robots, Sitemap & Schema

Pricing

Pay per usage

Go to Apify Store
Site Governance Monitor | Robots, Sitemap & Schema

Site Governance Monitor | Robots, Sitemap & Schema

Recurring robots.txt monitor, sitemap monitor, schema validator monitor, and release QA site monitor for homepage/pricing/docs drift, with one monitored domain summary per checked domain.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Catch robots.txt, sitemap, schema, and homepage/pricing/docs governance drift in one run.

This actor turns abstract "AI discoverability governance" into four concrete recurring checks buyers can immediately understand:

  • robots.txt monitor — watch missing robots.txt files, AI crawler allow/block rules, and policy drift
  • sitemap monitor — catch missing, stale, or undeclared XML sitemaps before discoverability drops
  • schema validator monitor — validate JSON-LD / Microdata on homepage, pricing, docs, and other key templates
  • release QA site monitor — compare snapshots over time so launches and template edits do not quietly break site governance

This is not a generic website audit. It is a summary-first site-governance monitor that keeps one monitored domain summary per checked domain, even when that domain has multiple alerts, warnings, and drift signals. That same monitored-domain summary remains the store-facing pricing unit.

Who this actor is for

  • Agencies that need one recurring summary row and action queue per client site
  • Platform teams that need the main site, docs, support, developer, and status properties to stay aligned
  • Release QA teams that need a lightweight pre/post-release check for homepage, pricing, and docs templates
  • SEO, content, and discoverability owners that want concrete monitoring for robots.txt, sitemaps, and schema markup

First successful run

Start with one real site and the three paths most buyers recognize first:

{
"domains": ["vercel.com"],
"samplePaths": ["/", "/pricing", "/docs"],
"delivery": "dataset",
"snapshotKey": "site-governance-homepage-pricing-docs",
"checkAiBots": true,
"checkSchema": true,
"checkSitemap": true
}

That single run gives you one monitored-domain summary that answers four practical questions:

  • Is robots.txt present and are AI crawler rules explicit?
  • Is the sitemap reachable, fresh, and declared in robots.txt?
  • Do homepage, pricing, and docs pages still publish valid schema markup?
  • Did anything drift since the last release or weekly checkpoint?

Store quickstart

  • Start with store-input.example.json for a concrete first run against vercel.com across /, /pricing, and /docs.
  • When that matches your workflow, switch to store-input.templates.json and choose one of:
    • Quickstart: Homepage, Pricing & Docs
    • Agency Portfolio Site Monitor
    • Release QA Site Monitor
    • Platform Site Governance Watch
    • Robots.txt + Sitemap + Schema Monitor

Dataset delivery is the best first proof. Webhook delivery becomes the next step once you want the action-needed queue in release QA, platform ops, or agency reporting.

What this actor does

For each domain or web property, the actor combines three machine-readable monitors plus drift detection into one monitored-domain summary:

  • Robots.txt monitor: parses robots.txt, evaluates known AI crawler groups, classifies posture, and flags policy drift
  • Sitemap monitor: discovers sitemap surfaces, expands sitemap indexes, evaluates freshness and lastmod coverage, and flags missing or stale inventories
  • Schema validator monitor: samples the supplied samplePaths[], extracts JSON-LD and Microdata, validates the markup, and detects coverage regressions
  • Release QA / site-governance drift detection: compares the current run with prior snapshots so post-release changes are easy to spot

The actor then turns those signals into:

  • one governanceScore
  • one ranked alerts[] list
  • one set of recommendedActions
  • one portfolio-style executiveSummary and actionNeededDigest

That monitored-domain summary is the flagship output and billing-safe unit. Multiple alerts, warnings, changes, and component details stay nested under that single per-domain summary.

Flagship recurring templates

TemplateBest forWhat it sharpens
Quickstart: Homepage, Pricing & DocsFirst success / solution engineersProves homepage, pricing, and docs drift detection in one summary row
Agency Portfolio Site MonitorAgencies / consultanciesRecurring multi-client sweeps with one summary per client site
Release QA Site MonitorRelease QA / web teamsPre/post-release checks on homepage, pricing, docs, and product templates
Platform Site Governance WatchPlatform governance / web opsAction-needed webhook for main site, docs, support, status, and developer properties
Robots.txt + Sitemap + Schema MonitorSEO / discoverability ownersOngoing monitoring for the three machine-readable surfaces buyers care about

Why this is better than separate utilities

Running robotstxt-ai-checker, sitemap-analyzer, and structured-data-validator separately creates operational noise:

  • three actors
  • three schedules
  • three payloads to reconcile
  • no shared governance score
  • no single action-needed queue

site-governance-monitor is the flagship combined lane:

  • one recurring task
  • one schedule
  • one dataset or webhook payload
  • one governance score per domain
  • one ranked list of domains that need attention first

That makes it a better fit for agencies, portfolio operators, platform governance owners, release QA teams, and discoverability leads that want one recurring signal instead of a bundle of disconnected utilities.

Input example

{
"domains": ["vercel.com"],
"samplePaths": ["/", "/pricing", "/docs"],
"delivery": "dataset",
"snapshotKey": "site-governance-homepage-pricing-docs",
"checkAiBots": true,
"checkSchema": true,
"checkSitemap": true,
"concurrency": 1,
"batchDelayMs": 250,
"requestTimeoutSecs": 15,
"maxSitemapUrls": 5000
}

Output example

{
"domain": "client-release.example",
"status": "changed",
"severity": "high",
"alertCount": 2,
"brief": "2 alert(s): No reachable XML sitemap was found for this domain.",
"governanceScore": {
"total": 46,
"grade": "F"
},
"recommendedActions": [
"Publish a reachable XML sitemap for the domain and keep it updated.",
"Publish a robots.txt file so the robots.txt monitor can confirm which AI crawlers you allow or block."
]
}

A fuller payload is available in sample-output.example.json. When samplePaths includes /, /pricing, and /docs, the full output also shows which release-sensitive pages lost schema coverage, plus portfolio-level executiveSummary and actionNeededDigest fields for webhook delivery.

Delivery modes

  • dataset: saves one monitored-domain summary row per checked domain to the actor dataset
  • webhook: sends the full governance payload (meta, alerts, results) to your webhook URL, with one monitored-domain summary in results[] for each checked domain

Dataset delivery is best for first proof, recurring QA evidence, and agency reporting. Webhook delivery is best when you want platform or release teams to work from an action-needed queue.

WorkflowWhy
Homepage / pricing / docs release QACatch schema, robots.txt, and sitemap drift on the pages buyers check first
Agency portfolio site monitorCatch robots.txt, sitemap, and schema drift across multiple clients or brands
Platform site governance watchKeep docs, support, developer, and status properties aligned with the main site
Robots.txt + sitemap + schema monitoringTrack whether crawlability and machine-readable discoverability stay intentional over time

Cost profile

Store pricing is aligned to the monitored-domain summary, not raw alert or event volume. One checked domain produces one summary unit, regardless of how many underlying governance findings are attached to it.

The actor uses built-in Node.js networking and public site surfaces. That keeps maintenance cost low and avoids browser or proxy requirements for the core checks.

Commercial ops

Set up .env first:

$cp -n .env.example .env

Configure the Apify task and schedule when you are ready for cloud ops:

$npm run apify:cloud:setup

Local validation for this repository version:

$npm test
  • robotstxt-ai-checker — standalone robots.txt monitor
  • sitemap-analyzer — standalone sitemap monitor
  • structured-data-validator — standalone schema validator monitor
  • domain-trust-monitor — broader bundled monitor for domain trust posture