Robots.txt Monitor
Pricing
Pay per event
Robots.txt Monitor
Stateful robots.txt monitoring with baseline awareness and severity-classified alerts. Detects meaningful policy changes over time — not noisy diffs.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DatawinderLabs
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
17 days ago
Last modified
Categories
Share
Stateful robots.txt monitoring Actor with baseline awareness, diff-based detection, and severity-classified alerts.
This Actor is designed for monitoring, not validation or SEO auditing.
It reports only meaningful changes over time and avoids noisy false positives.
This Actor is stateful. Alerts are emitted only after a baseline snapshot exists (from the second run onward).
Snapshot Contract
This Actor uses a versioned, stable snapshot schema.
- Snapshot version: v1
- Schema changes require explicit migration
- Downstream consumers may rely on field names and severity semantics
What this Actor monitors
- robots.txt availability (HTTP reachability)
- User-agent rule changes
- Allow / Disallow directive changes
- Crawl-delay and request-rate changes
- Sitemap directive changes
- Formatting-only edits (comments / whitespace)
The Actor stores a baseline snapshot on first run and compares all subsequent runs against it.
Alert Semantics (Severity Contract)
This Actor follows a strict severity contract.
Each severity level has a clear operational meaning so you can safely wire alerts without alert fatigue.
Severity levels
🔴 Critical
Meaning: Access restriction or loss of reachability.
You should act immediately if this affects your crawl policy or availability.
Triggered when:
robots.txtbecomes unreachable (HTTP error or network failure)- New
Disallowrules are added underUser-agent: * - Existing
Allowrules are removed and crawling becomes more restrictive
Critical alerts are intentionally rare.
🟠 Warning
Meaning: Policy change requiring review.
Review when convenient.
Triggered when:
- Disallow rules added for specific (non-global) user-agents
- User-agent blocks are removed
- Crawl-delay or request-rate values change
- Sitemap directives are removed
- All sitemap directives disappear
Warnings indicate policy changes, not outages.
🔵 Info
Meaning: Non-blocking or informational change.
No action required.
Triggered when:
- robots.txt recovers after being unreachable
- New user-agent blocks are added
- New sitemap directives are added
- Formatting-only changes (comments, whitespace, ordering)
Info events exist for traceability and audits.
Examples
Example 1 — robots.txt becomes unreachable
{"type": "robots_txt_unreachable","severity": "critical","description": "robots.txt became unreachable"}
Example 2 — New global disallow added
User-agent: *Disallow: /private/
{"type": "disallow_added","severity": "critical","description": "Disallow added for *: /private/"}
Example 3 — Crawl-delay changed
{"type": "crawl_delay_changed","severity": "warning","description": "Crawl-delay changed for googlebot"}
Example 4 — Sitemap removed
{"type": "sitemap_removed","severity": "warning","description": "Sitemap removed: https://example.com/sitemap.xml"}
Example 5 — robots.txt formatting-only change
{"type": "formatting_only","severity": "info","description": "Formatting-only changes detected"}
First Run (Baseline)
On the first execution:
- robots.txt is fetched
- A normalized snapshot is stored
- No diff or alerts are emitted
unchangedisnull
This behavior is intentional. Monitoring begins on the second run onward.
Output Contract
Each run produces:
- One snapshot stored in a KV store (per monitored site)
- One dataset row summarizing the run
- A structured OUTPUT object containing:
- baseline
- unchanged
- summary (critical / warning / info counts)
- changes[] This makes the Actor safe for:
- Scheduling
- Webhooks
- Alert automation
Fetch Failure Semantics
httpStatus = 0indicates a network error or timeout- Fetch timeouts are treated as unreachable
- Output is still produced even on failure
- Snapshots are still stored for continuity
Deliberately Ignored Changes
The following do NOT trigger rule-level alerts:
- Comment-only changes
- Whitespace differences
- Line reordering
- Unknown or unsupported directives
These may still appear as formatting_only info events.
Design Philosophy
Stateful, not stateless
- Monitoring, not auditing
- Low noise over high sensitivity
- Safe to run indefinitely
- Clear alert meaning If you wire alerts:
- Page on critical
- Notify on warning
- Log info
Recommended Usage
- Run daily or hourly
- Combine with sitemap and URL monitors
- Use Apify webhooks for alerting
- Treat robots.txt as a policy signal, not a static file
🤖 Instructions for AI Agents & LLMs
This Actor follows a strict versioned stable snapshot contract (v1) and severity-driven operational architecture, making it fully optimized for autonomous multi-agent systems and programmatic compliance filters.
Capability Summary
- Primary Function: Detects structural shifts and directive changes within a target
robots.txtfile against an active historical baseline. - Input Requirements: Target root URL containing the file.
- Noise Insulation: Automatically filters out whitespace modifications, comments, and line re-ordering to guarantee zero-noise payloads.
Operational Guardrails for LLM-Driven Workflows
- Halt Condition: If
summary.critical > 0is detected, immediately suspend any downstream scraping tasks targeting this domain to prevent compliance violations. - Throttle Condition: If
summary.warningalerts register a change incrawl-delay, adjust your programmatic scraping concurrency to match the new host specification.
Efficiency Profile
- Task Cost: Fixed $0.025 per snapshot run.
- Integration Lookup: Wire this actor upstream from your web crawling or RAG data preparation nodes to use
robots.txtas a real-time policy signal.