Robots.txt Monitor avatar
Robots.txt Monitor

Pricing

Pay per event

Go to Apify Store
Robots.txt Monitor

Robots.txt Monitor

Stateful robots.txt monitoring with baseline awareness and severity-classified alerts. Detects meaningful policy changes over time — not noisy diffs.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Datawinder

Datawinder

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Stateful robots.txt monitoring Actor with baseline awareness, diff-based detection, and severity-classified alerts.

This Actor is designed for monitoring, not validation or SEO auditing.
It reports only meaningful changes over time and avoids noisy false positives.


Snapshot Contract

This Actor uses a versioned, stable snapshot schema.

  • Snapshot version: v1
  • Schema changes require explicit migration
  • Downstream consumers may rely on field names and severity semantics

What this Actor monitors

  • robots.txt availability (HTTP reachability)
  • User-agent rule changes
  • Allow / Disallow directive changes
  • Crawl-delay and request-rate changes
  • Sitemap directive changes
  • Formatting-only edits (comments / whitespace)

The Actor stores a baseline snapshot on first run and compares all subsequent runs against it.


Alert Semantics (Severity Contract)

This Actor follows a strict severity contract.

Each severity level has a clear operational meaning so you can safely wire alerts without alert fatigue.

Severity levels

🔴 Critical

Meaning: Access restriction or loss of reachability.

You should act immediately if this affects your crawl policy or availability.

Triggered when:

  • robots.txt becomes unreachable (HTTP error or network failure)
  • New Disallow rules are added under User-agent: *
  • Existing Allow rules are removed and crawling becomes more restrictive

Critical alerts are intentionally rare.


🟠 Warning

Meaning: Policy change requiring review.

Review when convenient.

Triggered when:

  • Disallow rules added for specific (non-global) user-agents
  • User-agent blocks are removed
  • Crawl-delay or request-rate values change
  • Sitemap directives are removed
  • All sitemap directives disappear

Warnings indicate policy changes, not outages.


🔵 Info

Meaning: Non-blocking or informational change.

No action required.

Triggered when:

  • robots.txt recovers after being unreachable
  • New user-agent blocks are added
  • New sitemap directives are added
  • Formatting-only changes (comments, whitespace, ordering)

Info events exist for traceability and audits.


Examples

Example 1 — robots.txt becomes unreachable

{
"type": "robots_txt_unreachable",
"severity": "critical",
"description": "robots.txt became unreachable"
}

Example 2 — New global disallow added

User-agent: *
Disallow: /private/
{
"type": "disallow_added",
"severity": "critical",
"description": "Disallow added for *: /private/"
}

Example 3 — Crawl-delay changed

{
"type": "crawl_delay_changed",
"severity": "warning",
"description": "Crawl-delay changed for googlebot"
}

Example 4 — Sitemap removed

{
"type": "sitemap_removed",
"severity": "warning",
"description": "Sitemap removed: https://example.com/sitemap.xml"
}

Example 5 — robots.txt formatting-only change

{
"type": "formatting_only",
"severity": "info",
"description": "Formatting-only changes detected"
}

First Run (Baseline)

On the first execution:

  • robots.txt is fetched
  • A normalized snapshot is stored
  • No diff or alerts are emitted
  • unchanged is null

This behavior is intentional. Monitoring begins on the second run onward.


Output Contract

Each run produces:

  • One snapshot stored in a KV store (per monitored site)
  • One dataset row summarizing the run
  • A structured OUTPUT object containing:
    • baseline
    • unchanged
    • summary (critical / warning / info counts)
    • changes[] This makes the Actor safe for:
  • Scheduling
  • Webhooks
  • Alert automation

Fetch Failure Semantics

  • httpStatus = 0 indicates a network error or timeout
  • Fetch timeouts are treated as unreachable
  • Output is still produced even on failure
  • Snapshots are still stored for continuity

Deliberately Ignored Changes

The following do NOT trigger rule-level alerts:

  • Comment-only changes
  • Whitespace differences
  • Line reordering
  • Unknown or unsupported directives

These may still appear as formatting_only info events.


Design Philosophy

Stateful, not stateless

  • Monitoring, not auditing
  • Low noise over high sensitivity
  • Safe to run indefinitely
  • Clear alert meaning If you wire alerts:
  • Page on critical
  • Notify on warning
  • Log info

  • Run daily or hourly
  • Combine with sitemap and URL monitors
  • Use Apify webhooks for alerting
  • Treat robots.txt as a policy signal, not a static file