Pricing

from $11.00 / 1,000 results

robots.txt Parser & AI Crawler Block Checker

robots.txt parser that audits AI crawler block rules (GPTBot, ClaudeBot, anthropic-ai, PerplexityBot) across thousands of websites in one run. Returns per-bot allow/disallow disposition and crawl-delay.

Pricing

from $11.00 / 1,000 results

Rating

0.0

(0)

Developer

太郎山田

Actor stats

Bookmarked

Total users

Monthly active users

7 days ago

Last modified

robots.txt AI Checker | GPTBot, ClaudeBot & AI Crawl Rules

Track how publishers and arbitrary websites handle AI crawlers, LLM training bots, and search indexing tools with this specialized web scraper. As data collection for generative AI models becomes a massive point of contention, tracking policy shifts across target domains is critical for maintaining compliance and understanding the evolving web ecosystem. This tool automatically fetches and parses robots.txt pages, allowing you to extract detailed bot policies specifically targeting agents like GPTBot, ClaudeBot, and Google-Extended. Users rely on this scraper to audit thousands of URLs effortlessly, substituting manual website checks with automated, scheduled runs. Set up daily or weekly monitoring workflows to immediately detect when a domain updates its scraping rules or imposes new restrictions on AI data collection. The system extracts structured data directly from the raw text, returning concrete details about which specific bots are explicitly allowed or disallowed. Output fields include the exact user-agent string, the restricted directory paths, and crawl-delay directives. By identifying exactly what changed since your last run, you can securely build web datasets, respect publisher boundaries, and integrate compliance checks directly into your broader data engineering pipelines.

Store Quickstart

Start with store-input.example.json. It uses demoMode=true so the first Store run is safe, cheap, and easy to understand.
If the compact output is useful, switch to store-input.templates.json and pick one of:
Demo Quickstart for a trial run
Production Monitor for recurring dataset snapshots
Webhook Alert for policy-change notifications

Key Features

🛡️ Compliance-first — Produces audit-ready reports mapping findings to standards (WCAG, GDPR, SOC2)
🔒 Non-invasive scanning — Uses only observable public signals — no intrusive probing
📊 Severity-scored output — Each finding rated for criticality with remediation guidance
📡 Delta-alerting — Flag new findings since last run via webhook delivery
📋 Evidence export — Raw headers/responses captured for compliance documentation

Use Cases

Who	Why
Developers	Automate recurring data fetches without building custom scrapers
Data teams	Pipe structured output into analytics warehouses
Ops teams	Monitor changes via webhook alerts
Product managers	Track competitor/market signals without engineering time

Input

Field	Type	Default	Description
domains	array	prefilled	List of domains to analyze robots.txt for AI crawler policies. Max 500.
delivery	string	`"dataset"`	How to deliver results. 'dataset' saves to Apify Dataset, 'webhook' sends to a URL. In demoMode, delivery is always data
webhookUrl	string	—	Webhook URL to send results to (only used when delivery is 'webhook'). Works with Slack, Discord, or any HTTP endpoint.
snapshotKey	string	`"robotstxt-snapshots"`	Key name for storing snapshots (used for change detection between runs).
concurrency	integer	`5`	Maximum number of parallel requests. Higher = faster but may trigger rate limits.
dryRun	boolean	`false`	If true, runs without saving results or sending webhooks. Useful for testing.
demoMode	boolean	`false`	If true, checks only 1 domain, returns compact policy fields, and disables webhook/snapshot writes.

Input Example

{
  "domains": [
    "google.com",
    "github.com",
    "nytimes.com",
    "openai.com"
  ],
  "delivery": "dataset",
  "snapshotKey": "robotstxt-snapshots",
  "concurrency": 5,
  "dryRun": false,
  "demoMode": false
}

Input Examples

Example: Single domain AI bot audit

{
  "domains": [
    "example.com"
  ],
  "bots": [
    "GPTBot",
    "ClaudeBot",
    "anthropic-ai",
    "PerplexityBot"
  ]
}

Example: Bulk competitor sites

{
  "domains": [
    "competitor1.com",
    "competitor2.com",
    "competitor3.com"
  ],
  "bots": [
    "GPTBot",
    "ClaudeBot"
  ],
  "emitPerBotDisposition": true
}

Example: All-AI-bot policy snapshot

{
  "domains": [
    "nytimes.com",
    "wsj.com",
    "ft.com"
  ],
  "detectAllAiBots": true
}

Output

Field	Type	Description
`meta`	object
`results`	array
`results[].domain`	string
`results[].status`	string
`results[].summary`	object
`results[].aiPolicies`	array
`results[].changes`	array
`results[].checkedAt`	timestamp
`results[].demoApplied`	boolean
`results[].detailsMasked`	boolean
`results[].error`	null

Output Example

{
  "meta": {
    "generatedAt": "2026-02-22T17:50:20.909Z",
    "totals": {
      "total": 1,
      "requestedDomains": 2,
      "processedDomains": 1,
      "withRobotsTxt": 1,
      "noRobotsTxt": 0,
      "invalidDomains": 0,
      "blockingAi": 0,
      "errors": 0
    },
    "demoApplied": true,
    "limits": {
      "maxDomains": 1,
      "compactPolicies": true,
      "webhookEnabled": false,
      "snapshotWriteEnabled": false
    },
    "upgradeHint": "Demo mode checks 1 domain, disables webhook delivery, and returns a compact policy view. Set demoMode=false to unlock bulk checks and full policy details."
  },
  "results": [
    {
      "domain": "openai.com",
      "status": "ok",
      "summary": {
        "totalCrawlers": 16,
        "blocked": 0,
        "partialBlock": 16,
        "allowed": 0,
        "changed": 0
      },
      "aiPolicies": [
        {
          "crawler": "GPTBot",
          "company": "OpenAI",
          "blocked": false,
          "partialBlock": true,
          "allowed": false

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console → Settings → Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~robotstxt-ai-checker/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "domains": [ "google.com", "github.com", "nytimes.com", "openai.com" ], "delivery": "dataset", "snapshotKey": "robotstxt-snapshots", "concurrency": 5, "dryRun": false, "demoMode": false }'

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/robotstxt-ai-checker").call(run_input={
  "domains": [
    "google.com",
    "github.com",
    "nytimes.com",
    "openai.com"
  ],
  "delivery": "dataset",
  "snapshotKey": "robotstxt-snapshots",
  "concurrency": 5,
  "dryRun": false,
  "demoMode": false
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

JavaScript / Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/robotstxt-ai-checker').call({
  "domains": [
    "google.com",
    "github.com",
    "nytimes.com",
    "openai.com"
  ],
  "delivery": "dataset",
  "snapshotKey": "robotstxt-snapshots",
  "concurrency": 5,
  "dryRun": false,
  "demoMode": false
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Tips & Limitations

Schedule weekly runs against your production domains to catch config drift.
Use webhook delivery to pipe findings into your SIEM (Splunk, Datadog, Elastic).
For CI integration, block releases on critical severity findings using exit codes.
Combine with ssl-certificate-monitor for layered cert + headers coverage.
Findings include links to official remediation docs — share with dev teams via the webhook payload.

FAQ

Is running this against a third-party site legal?

Passive public-header scanning is generally permitted, but follow your own compliance policies. Only scan sites you have authorization for.

How often should I scan?

Weekly for production domains; daily if you have high config-change velocity.

Can I export to a compliance tool?

Use webhook delivery or Dataset API — formats map well to Drata, Vanta, OneTrust import templates.

Is this a penetration test?

No — this actor performs passive compliance scanning only. No exploitation, fuzzing, or auth bypass.

Does this qualify as a SOC2 control?

This actor produces evidence artifacts suitable for SOC2 CC7.1 (continuous monitoring). It is not itself a SOC2 certification.

Security & Compliance cluster — explore related Apify tools:

Privacy & Cookie Compliance Scanner | GDPR / CCPA Banner Audit — Scan public privacy pages and cookie banners for GDPR/CCPA compliance signals.
Security Headers Checker API | OWASP Audit — Bulk-audit websites for OWASP security headers, grade each response, and monitor header changes between runs.
SSL Certificate Monitor API | Expiry + Issuer Changes — Check SSL/TLS certificates in bulk, detect expiry and issuer changes, and emit alert-ready rows for ops and SEO teams.
DNS / SPF / DKIM / DMARC Audit API — Bulk-audit domains for SPF, DKIM, DMARC, MX, and email-auth posture with grades and fix-ready recommendations.
Data Breach Disclosure Monitor | HIPAA Breach Watch — Monitor the HHS OCR Breach Portal for new HIPAA data breach disclosures.
WCAG Accessibility Checker API | ADA & EAA Compliance Audit — Audit websites for WCAG 2.
📜 Open-Source License & Dependency Audit API — Audit npm packages for license risk, dependency depth, maintainer activity, and compliance posture.
Trust Center & Subprocessor Monitor API — Monitor vendor trust centers, subprocessor lists, DPA updates, and security posture changes.

Cost

Pay Per Event:

actor-start: $0.01 (flat fee per run)
dataset-item: $0.003 per output item

Example: 1,000 items = $0.01 + (1,000 × $0.003) = $3.01

No subscription required — you only pay for what you use.

⭐ Was this helpful?

If this actor saved you time, please leave a ★ rating on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the Issues tab of this actor.

Robots.txt Generator

automation-lab/robots-txt-generator

Generate valid robots.txt files from structured rules. Apply presets (block AI bots, SEO-friendly), add custom per-bot rules, sitemaps, and crawl-delay. Zero-proxy, instant output.

Stas Persiianenko

Robots.txt Validator - Check Rules, Sitemaps & Crawl Directives

scrappy_garden/robots-txt-validator

Validate robots.txt for one or more websites: fetches /robots.txt per host, parses directive groups (User-agent/Allow/Disallow/Crawl-delay/Sitemap), reports common errors and warnings, and can test URLs against the chosen User-Agent.

Bikram Adhikari

Ai Visibility Suite - Dark Visitors Alternative

alizarin_refrigerator-owner/ai-visibility-suite---dark-visitors-alternative

Comprehensive AI bot monitoring, robots.txt analysis, LLMs.txt generation & AI shopping optimization. Monitor AI crawlers visits, check AI compliance, generate AI-friendly configurations, and optimize for AI shopping agents. AI Bot Directory Robots.txt LLMs.txt AI Shopping Competitor AI Audit

The Howlers

Robots.txt Validator - Crawl Rules Analyzer

pink_comic/robots-txt-validator

Analyze robots.txt files for any domain. Extract crawl rules, sitemaps, blocked paths, and crawl-delay settings. Validate configuration and identify SEO issues in bulk.

Ava Torres

GEO Site Audit - AI Readiness Checker

dltik/geo-site-audit

Audit your website for AI crawler accessibility: robots.txt (GPTBot, ClaudeBot, Perplexity), llms.txt, sitemap, Schema.org, meta tags, content extractability, TTFB. Get an AI-readiness score 0-100 with prioritized recommendations.

dltik

5.0

Robots.txt Checker - CMS-Aware Analysis with AI Recommendations

alizarin_refrigerator-owner/robots-txt-checker

The Robots.txt Checker provides comprehensive analysis of your robots.txt file: Syntax Validation CMS Detection - Identify WordPress, Shopify, Drupal,& 6+ other CMS platforms Best Practice Check Companion File Checks - sitemap.xml, llms.txt, security.txt AI Recommendations - CMS-specific suggestions

The Howlers

Robots.txt Auditor & Sitemap Finder

andok/robotstxt-auditor

Scan robots.txt files in bulk to extract sitemap URLs and verify crawler directives for technical SEO compliance.

Andok

Robots.txt Validator

predictable_function/my-actor-3

List of website base URLs whose robots.txt files will be validated

riya rawat

5.0

AI Readiness Checker - Website Scanner

alizarin_refrigerator-owner/ai-readiness-checker

Analyze any website for AI optimization readiness. Check robots.txt, llms.txt, structured data, meta tags & content quality. Get actionable recommendations to improve AI crawler accessibility.

The Howlers

Robots Txt Analyzer

zerobreak/robots-txt-analyzer

Robots txt analyzer that fetches and parses crawl rules from any website in bulk, so SEO teams and developers can audit blocked paths, user agents, and sitemap locations across hundreds of domains without manual work.