Pricing

from $9.00 / 1,000 results

Site Governance Page Scraper

Track deployment regressions and structural drift across your website. Automate robots.txt, sitemap, and schema checks immediately after code pushes.

Pricing

from $9.00 / 1,000 results

Rating

0.0

(0)

Developer

naoki anzai

Actor stats

Bookmarked

Total users

Monthly active users

14 days ago

Last modified

Site Governance Monitor | Robots, Sitemap & Schema

Protect your digital infrastructure from accidental deployment regressions with a dedicated site governance monitor. Engineering teams and QA testers rely on this automated web scraper to inspect pages immediately after a code push, ensuring that new releases don't break essential SEO and compliance settings. Instead of manually checking production environments, you can schedule this browser-based tool to run automatically and track unwanted drift across your homepage, documentation, and pricing pages.

When integrated into your testing workflow, the scraper extracts critical details to verify that search engine bots from Google and other platforms can safely crawl your website. It actively monitors changes to robots.txt rules, sitemap availability, and structured data schema configurations. Whether checking a handful of URLs or scraping hundreds of pages, this tool acts as a safeguard against silent failures that impact web visibility and user experience.

The monitor generates structured data specifically tailored for developers and QA workflows. You can easily access concrete metrics like deployment_drift_percentage, robots_txt_diff, and schema_validation_errors. By establishing a continuous monitoring schedule, teams can proactively identify broken elements, missing schema, and faulty sitemaps before they affect organic search results.

Store Quickstart

Start with store-input.example.json for a concrete homepage-first run against vercel.com.
When that matches your workflow, switch to store-input.templates.json and choose one of:
- Quickstart: Homepage Governance Check (Starter Baseline)
- Agency Portfolio Site Monitor (Advanced Recurring)
- Release QA Site Monitor (Schema Regression Watch)
- Platform Site Governance Watch (Advanced Delivery)
- Robots.txt + Sitemap + Schema Monitor (Recurring Discoverability)

Key Features

🔗 URL-first workflow — Bulk-process thousands of URLs per run with parallel fetching
📊 Structured output — Every URL returns consistent, dataset-ready rows for downstream use
🛡️ Rate-limit aware — Exponential backoff and concurrency throttling keep you off block lists
📡 Webhook delivery — Push results to Slack, Discord, or any HTTP endpoint for real-time alerts
💰 No external APIs — Reads public data — zero API-key costs, zero vendor lock-in

Use Cases

Who	Why
Developers	Automate recurring data fetches without building custom scrapers
Data teams	Pipe structured output into analytics warehouses
Ops teams	Monitor changes via webhook alerts
Product managers	Track competitor/market signals without engineering time

Input

Field	Type	Default	Description
domains	array	prefilled	Starter quickstart: begin with 1-3 sites for a lightweight first success. Homepage-first runs stay intentionally small,
samplePaths	array	prefilled	Path-only routes to validate on every domain. Keep the starter quickstart homepage-first with ["/"], then add /pricing a
delivery	string	`"dataset"`	Starter path: dataset keeps the first run low-friction and still writes the full summary-first payload to OUTPUT. Advanc
webhookUrl	string	—	Advanced delivery only: required when delivery is webhook. Must be a valid http(s) URL. The payload includes the executi
snapshotKey	string	`"site-governance-monitor-snapshots"`	Keep this stable when you move from the homepage-first quickstart to recurring release-QA, portfolio, or webhook workflo
checkAiBots	boolean	`true`	Monitor robots.txt for missing files, AI crawler allow/block rules, and drift after releases.
checkSchema	boolean	`true`	Validate JSON-LD and Microdata on homepage, pricing, docs, and other release-sensitive templates.
checkSitemap	boolean	`true`	Monitor sitemap.xml reachability, freshness, robots.txt declarations, and URL inventory drift.

Input Example

{
  "domains": ["vercel.com"],
  "samplePaths": ["/"],
  "delivery": "dataset",
  "snapshotKey": "site-governance-homepage-quickstart",
  "checkAiBots": true,
  "checkSchema": true,
  "checkSitemap": true,
  "concurrency": 1,
  "batchDelayMs": 250,
  "requestTimeoutSecs": 15,
  "maxSitemapUrls": 5000
}

Input Examples

Example: Single vendor privacy

{
  "urls": [
    "https://vendor.com/privacy"
  ]
}

Example: Bulk vendor portfolio

{
  "urls": [
    "https://vendor1.com/privacy",
    "https://vendor1.com/terms",
    "https://vendor2.com/privacy"
  ]
}

Example: Recurring change detection

{
  "urls": [
    "https://vendor.com/privacy"
  ],
  "snapshotKey": "vendor-privacy",
  "emitChangedOnly": true
}

Output

Field	Type	Description
`meta`	object
`alerts`	array
`results`	array
`alerts[].domain`	string
`alerts[].severity`	string
`alerts[].component`	string
`alerts[].type`	string
`alerts[].message`	string

Output Example

{
  "meta": {
    "executiveSummary": {
      "overallStatus": "attention_needed",
      "recommendedCadence": "daily"
    },
    "runProfile": {
      "tier": "starter",
      "label": "Starter first-success path"
    },
    "upgradeSuggestions": [
      {
        "type": "webhook",
        "templateId": "action_needed_webhook",
        "title": "Route action-needed domains to your endpoint"
      }
    ],
    "nextWorkflow": {
      "type": "same_actor_template",
      "id": "action_needed_webhook",
      "title": "Next best step: Action-Needed Webhook Handoff"
    }
  },
  "alerts": [
    {
      "domain": "client-release.example",
      "severity": "high",
      "component": "sitemapHealth",
      "type": "sitemap_missing",
      "message": "No reachable XML sitemap was found for this domain."
    }
  ],
  "results": [
    {
      "domain": "client-release.example",
      "status": "changed",
      "severity": "high",
      "brief": "3 alert(s): No reachable XML sitemap was found for this domain.",
      "recommendedActions": [
        "Publish a reachable XML sitemap for the domain and keep it updated.",
        "Publish a robots.txt file so the robots.txt monitor can confirm which AI crawlers you allow or block."
      ]
    }
  ]
}

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~site-governance-monitor/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "domains": ["vercel.com"], "samplePaths": ["/"], "delivery": "dataset", "snapshotKey": "site-governance-homepage-quickstart", "checkAiBots": true, "checkSchema": true, "checkSitemap": true, "concurrency": 1, "batchDelayMs": 250, "requestTimeoutSecs": 15, "maxSitemapUrls": 5000 }'

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/site-governance-monitor").call(run_input={
  "domains": ["vercel.com"],
  "samplePaths": ["/"],
  "delivery": "dataset",
  "snapshotKey": "site-governance-homepage-quickstart",
  "checkAiBots": true,
  "checkSchema": true,
  "checkSitemap": true,
  "concurrency": 1,
  "batchDelayMs": 250,
  "requestTimeoutSecs": 15,
  "maxSitemapUrls": 5000
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/site-governance-monitor').call({
  "domains": ["vercel.com"],
  "samplePaths": ["/"],
  "delivery": "dataset",
  "snapshotKey": "site-governance-homepage-quickstart",
  "checkAiBots": true,
  "checkSchema": true,
  "checkSitemap": true,
  "concurrency": 1,
  "batchDelayMs": 250,
  "requestTimeoutSecs": 15,
  "maxSitemapUrls": 5000
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

Keep concurrency ≤ 5 when auditing production sites to avoid WAF rate-limit triggers.
Use webhook delivery for recurring cron runs — push only deltas to downstream systems.
Enable dryRun for cheap validation before committing to a paid cron schedule.
Results are dataset-first; use Apify API run-sync-get-dataset-items for instant JSON in CI pipelines.
Run a tiny URL count first, review the sample, then scale up — pay-per-event means you only pay for what you use.

FAQ

Is there a rate limit?

Built-in concurrency throttling keeps requests polite. For most public APIs this actor can run 1–10 parallel requests without issues.

What happens when the input URL is unreachable?

The actor records an error row with the failure reason — successful URLs keep processing.

Can I schedule recurring runs?

Yes — use Apify Schedules to run this actor on a cron (hourly, daily, weekly). Combine with webhook delivery for change alerts.

Does this actor respect robots.txt?

Yes — requests use a standard User-Agent and honor site rate limits. For aggressive audits, set a higher concurrency only on your own properties.

Can I integrate with Google Sheets or Airtable?

Use webhook delivery with a Zapier/Make/n8n catcher, or call the Apify REST API from Apps Script / Airtable automations.

URL/Link Tools cluster — explore related Apify tools:

🔗 URL Health Checker — Bulk-check HTTP status codes, redirects, SSL validity, and response times for thousands of URLs.
🔗 Broken Link Checker — Crawl websites to find broken links, 404 errors, and dead URLs.
🔗 URL Unshortener — Expand bit.
🏷️ Meta Tag Analyzer — Analyze meta tags, Open Graph, Twitter Cards, JSON-LD, and hreflang for any URL.
📚 Wayback Machine Checker — Check if URLs are archived on the Wayback Machine and find closest snapshots by date.
Sitemap Analyzer API | sitemap.xml SEO Audit — Analyze sitemap.
Schema.org Validator API | JSON-LD + Microdata — Validate JSON-LD and Microdata across multiple pages, score markup quality, and flag missing or malformed Schema.
RDAP Domain Monitor API | Ownership + Expiry — Monitor domain registration data via RDAP and track expiry, registrar, nameserver, and ownership changes in structured rows.
Domain Security Audit API | SSL Expiry, DMARC, Domain Expiry — Summary-first portfolio monitor for SSL expiry, DMARC/SPF/DKIM, domain expiry/ownership, and security headers with remediation-ready outputs.

Cost

Pay Per Event:

actor-start: $0.01 (flat fee per run)
dataset-item: $0.003 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.

Robots.txt Auditor & Sitemap Finder

andok/robotstxt-auditor

Scan robots.txt files in bulk to extract sitemap URLs and verify crawler directives for technical SEO compliance.

Andok

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

Sitemap Robots Delta Monitor

tom_the_builder/sitemap-robots-delta-monitor

Monitor sitemap.xml and robots.txt for URL inventory changes and return new, changed, or removed URLs in normalized JSON.

Danil Iarmolchik

Robots.txt Checker - CMS-Aware Analysis with AI Recommendations

alizarin_refrigerator-owner/robots-txt-checker

The Robots.txt Checker provides comprehensive analysis of your robots.txt file: Syntax Validation CMS Detection - Identify WordPress, Shopify, Drupal,& 6+ other CMS platforms Best Practice Check Companion File Checks - sitemap.xml, llms.txt, security.txt AI Recommendations - CMS-specific suggestions

The Howlers

XML Sitemap Checker

coder_luffy/xml-sitemap-checker

Verify if your website has a properly configured XML sitemap. Checks robots.txt and common paths, validates accessibility, XML structure, content type, and URL count — ensuring search engines can easily crawl and index your site.

Luffy

Robots.txt & Sitemap Analyzer

automation-lab/robots-sitemap-analyzer

This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive...

Stas Persiianenko

Sitemap & URL Discovery - Find All URLs on Any Site

santamaria-automations/sitemap-url-discovery

Discover every URL on any website by parsing sitemap.xml, robots.txt, and sitemap indexes. Extract URLs with last modified dates, change frequency, and priority. Perfect for SEO audits, content analysis, crawling preparation, and site mapping.

Ale

Robots.txt Validator

predictable_function/my-actor-3

List of website base URLs whose robots.txt files will be validated

riya rawat

5.0

Robots Txt Analyzer

zerobreak/robots-txt-analyzer

Robots txt analyzer that fetches and parses crawl rules from any website in bulk, so SEO teams and developers can audit blocked paths, user agents, and sitemap locations across hundreds of domains without manual work.

ZeroBreak

Fast Sitemap Generator

eunit/sitemap-generator

Boost SEO with this automatic Sitemap Generator. Crawl any site to create XML, HTML, & TXT sitemaps. Supports custom depth, regex filters, & robots.txt. Compatible with Google Search Console.