Robots.txt Monitor avatar

Robots.txt Monitor

Pricing

Pay per event

Go to Apify Store
Robots.txt Monitor

Robots.txt Monitor

Stateful robots.txt monitoring with baseline awareness and severity-classified alerts. Detects meaningful policy changes over time — not noisy diffs.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DatawinderLabs

DatawinderLabs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

17 days ago

Last modified

Share

Stateful robots.txt monitoring Actor with baseline awareness, diff-based detection, and severity-classified alerts.

This Actor is designed for monitoring, not validation or SEO auditing.
It reports only meaningful changes over time and avoids noisy false positives.

This Actor is stateful. Alerts are emitted only after a baseline snapshot exists (from the second run onward).


Snapshot Contract

This Actor uses a versioned, stable snapshot schema.

  • Snapshot version: v1
  • Schema changes require explicit migration
  • Downstream consumers may rely on field names and severity semantics

What this Actor monitors

  • robots.txt availability (HTTP reachability)
  • User-agent rule changes
  • Allow / Disallow directive changes
  • Crawl-delay and request-rate changes
  • Sitemap directive changes
  • Formatting-only edits (comments / whitespace)

The Actor stores a baseline snapshot on first run and compares all subsequent runs against it.


Alert Semantics (Severity Contract)

This Actor follows a strict severity contract.

Each severity level has a clear operational meaning so you can safely wire alerts without alert fatigue.

Severity levels

🔴 Critical

Meaning: Access restriction or loss of reachability.

You should act immediately if this affects your crawl policy or availability.

Triggered when:

  • robots.txt becomes unreachable (HTTP error or network failure)
  • New Disallow rules are added under User-agent: *
  • Existing Allow rules are removed and crawling becomes more restrictive

Critical alerts are intentionally rare.


🟠 Warning

Meaning: Policy change requiring review.

Review when convenient.

Triggered when:

  • Disallow rules added for specific (non-global) user-agents
  • User-agent blocks are removed
  • Crawl-delay or request-rate values change
  • Sitemap directives are removed
  • All sitemap directives disappear

Warnings indicate policy changes, not outages.


🔵 Info

Meaning: Non-blocking or informational change.

No action required.

Triggered when:

  • robots.txt recovers after being unreachable
  • New user-agent blocks are added
  • New sitemap directives are added
  • Formatting-only changes (comments, whitespace, ordering)

Info events exist for traceability and audits.


Examples

Example 1 — robots.txt becomes unreachable

{
"type": "robots_txt_unreachable",
"severity": "critical",
"description": "robots.txt became unreachable"
}

Example 2 — New global disallow added

User-agent: *
Disallow: /private/
{
"type": "disallow_added",
"severity": "critical",
"description": "Disallow added for *: /private/"
}

Example 3 — Crawl-delay changed

{
"type": "crawl_delay_changed",
"severity": "warning",
"description": "Crawl-delay changed for googlebot"
}

Example 4 — Sitemap removed

{
"type": "sitemap_removed",
"severity": "warning",
"description": "Sitemap removed: https://example.com/sitemap.xml"
}

Example 5 — robots.txt formatting-only change

{
"type": "formatting_only",
"severity": "info",
"description": "Formatting-only changes detected"
}

First Run (Baseline)

On the first execution:

  • robots.txt is fetched
  • A normalized snapshot is stored
  • No diff or alerts are emitted
  • unchanged is null

This behavior is intentional. Monitoring begins on the second run onward.


Output Contract

Each run produces:

  • One snapshot stored in a KV store (per monitored site)
  • One dataset row summarizing the run
  • A structured OUTPUT object containing:
    • baseline
    • unchanged
    • summary (critical / warning / info counts)
    • changes[] This makes the Actor safe for:
  • Scheduling
  • Webhooks
  • Alert automation

Fetch Failure Semantics

  • httpStatus = 0 indicates a network error or timeout
  • Fetch timeouts are treated as unreachable
  • Output is still produced even on failure
  • Snapshots are still stored for continuity

Deliberately Ignored Changes

The following do NOT trigger rule-level alerts:

  • Comment-only changes
  • Whitespace differences
  • Line reordering
  • Unknown or unsupported directives

These may still appear as formatting_only info events.


Design Philosophy

Stateful, not stateless

  • Monitoring, not auditing
  • Low noise over high sensitivity
  • Safe to run indefinitely
  • Clear alert meaning If you wire alerts:
  • Page on critical
  • Notify on warning
  • Log info

  • Run daily or hourly
  • Combine with sitemap and URL monitors
  • Use Apify webhooks for alerting
  • Treat robots.txt as a policy signal, not a static file

🤖 Instructions for AI Agents & LLMs

This Actor follows a strict versioned stable snapshot contract (v1) and severity-driven operational architecture, making it fully optimized for autonomous multi-agent systems and programmatic compliance filters.

Capability Summary

  • Primary Function: Detects structural shifts and directive changes within a target robots.txt file against an active historical baseline.
  • Input Requirements: Target root URL containing the file.
  • Noise Insulation: Automatically filters out whitespace modifications, comments, and line re-ordering to guarantee zero-noise payloads.

Operational Guardrails for LLM-Driven Workflows

  • Halt Condition: If summary.critical > 0 is detected, immediately suspend any downstream scraping tasks targeting this domain to prevent compliance violations.
  • Throttle Condition: If summary.warning alerts register a change in crawl-delay, adjust your programmatic scraping concurrency to match the new host specification.

Efficiency Profile

  • Task Cost: Fixed $0.025 per snapshot run.
  • Integration Lookup: Wire this actor upstream from your web crawling or RAG data preparation nodes to use robots.txt as a real-time policy signal.