Robots Txt Audit avatar

Robots Txt Audit

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Robots Txt Audit

Robots Txt Audit

Audit robots.txt files for AI crawler access. Get an AI Readiness Score (0-100), analyze 61+ AI crawlers (ChatGPT, Claude, Perplexity, Gemini), detect syntax errors, security concerns, and get actionable recommendations. Batch audit multiple domains at once with optional subdomain scanning.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Andy Page

Andy Page

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Robots.txt Audit Actor (AI Crawler Edition)

Audit robots.txt files for AI crawler management. Get an AI Readiness Score (0-100) for every domain, see which AI systems — ChatGPT, Claude, Perplexity, Gemini, and 60+ others — can access your content, detect syntax errors, flag security concerns, and get actionable recommendations to maximize your AI search visibility.

Why Use This Actor?

  • AI Readiness Score — A single 0-100 headline metric that instantly tells you how AI-discoverable your site is, with a letter grade (A-F) and detailed breakdown
  • 60+ AI crawlers tracked — The most comprehensive AI crawler database available, covering AI Agents, AI Assistants, AI Search, and AI Training bots
  • Instant visibility — Know in seconds if ChatGPT, Claude, Perplexity, Brave, or Gemini can cite your content
  • Subdomain scanning — Discover robots.txt blind spots on www, blog, shop, api, docs, and 8 other common subdomains
  • Batch auditing — Audit hundreds of domains in a single run with configurable concurrency
  • Actionable recommendations — Get prioritized fixes with ready-to-paste robots.txt snippets
  • Security scanning — Detect sensitive path disclosures that actually help attackers
  • Competitor comparison — See how your AI crawler strategy compares to competitors

Features

FeatureDescription
AI Readiness Score0-100 score with A-F grade measuring AI discoverability
AI Crawler DetectionAnalyzes 60+ AI crawlers across AI Agent, AI Assistant, AI Search, AI Data Scraper, and AI Training categories
Subdomain ScanningChecks 13 common subdomains for separate robots.txt files
Search Engine AnalysisChecks access for Googlebot, Bingbot, DuckDuckBot, Yandex, Baidu, and more
Syntax ValidationDetects typos, malformed lines, non-standard directives, and orphaned rules
Security AuditFlags sensitive paths disclosed in robots.txt (admin panels, .env, .git, etc.)
Sitemap ValidationVerifies declared sitemaps and discovers undeclared sitemaps at common locations
Strategy ClassificationClassifies your AI posture as open, restrictive, mixed, or undefined
Competitor ComparisonSide-by-side AI crawler strategy comparison across domains
Proxy SupportOptional Apify proxy integration for fetching from restricted networks
CSV ExportFlattened output for spreadsheet analysis

AI Crawlers Tracked (60+)

AI Agents (5)

CrawlerCompanyImportance
ChatGPT-AgentOpenAIHigh
GoogleAgent-MarinerGoogleMedium
NovaActAmazonMedium
AmazonBuyForMeAmazonLow
Manus-UserButterfly EffectLow

AI Assistants (10)

CrawlerCompanyImportance
Gemini-Deep-ResearchGoogleHigh
Google-NotebookLMGoogleMedium
MistralAI-UserMistralMedium
PhindBotPhindLow
Amzn-UserAmazonLow
kagi-fetcherKagiLow
Ai2Bot-DeepResearchEvalAI2Low
DevinCognitionLow
TavilyBotTavilyLow
LinerBotLinerLow

AI Search (20)

CrawlerCompanyImportance
GPTBotOpenAICritical
ChatGPT-UserOpenAICritical
OAI-SearchBotOpenAICritical
ClaudeBotAnthropicHigh
Claude-SearchBotAnthropicHigh
Claude-UserAnthropicHigh
Claude-WebAnthropicHigh
PerplexityBotPerplexity AIHigh
Perplexity-UserPerplexity AIHigh
BravebotBraveMedium
AzureAI-SearchBotMicrosoftMedium
DuckAssistBotDuckDuckGoMedium
AmazonbotAmazonMedium
meta-webindexerMetaMedium
FacebookBotMetaLow
Amzn-SearchBotAmazonLow
YouBotYou.comLow
PetalBotHuaweiLow
Cloudflare-AutoRAGCloudflareLow
AddSearchBot / Anomura / atlassian-bot / Channel3Bot / LinkupBot / ZanistaBotVariousLow

AI Data Scrapers (9)

CrawlerCompanyImportance
GoogleOtherGoogleMedium
imageSpiderByteDanceLow
cohere-training-data-crawlerCohereLow
ChatGLM-SpiderZhipu AILow
PanguBotHuaweiLow
TimpibotTimpiLow
webzio-extendedWebz.ioLow
Kangaroo BotKangaroo LLMLow
VelenPublicWebCrawlerVelen/HunterLow

AI Training (10)

CrawlerCompanyImportance
Google-ExtendedGoogleHigh
anthropic-aiAnthropicHigh
Google-CloudVertexBotGoogleMedium
Applebot-ExtendedAppleMedium
CCBotCommon CrawlMedium
BytespiderByteDance/TikTokLow
meta-externalagentMetaLow
Meta-ExternalFetcherMetaLow
DiffbotDiffbotLow
Omgilibot / cohere-ai / AI2BotVariousLow

Plus 7 search engine crawlers, 7 SEO tool crawlers, and 8 social media crawlers.

Input Parameters

ParameterTypeRequiredDefaultDescription
domainsstring[]YesDomains to audit (e.g., ["example.com", "github.com"])
checkCompetitorsstring[]No[]Competitor domains for side-by-side comparison
includeSitemapbooleanNotrueValidate sitemap URLs declared in robots.txt
scanSubdomainsbooleanNofalseScan 13 common subdomains (www, blog, shop, api, dev, staging, app, docs, help, support, cdn, mail, admin) for separate robots.txt files
maxConcurrencyintegerNo10Max domains to process in parallel (1-50)
outputFormatstringNo"json"Output format: json or csv
proxyConfigobjectNoApify residentialProxy configuration for fetching robots.txt files

Example Input

{
"domains": ["nytimes.com", "github.com", "shopify.com"],
"checkCompetitors": ["medium.com"],
"includeSitemap": true,
"scanSubdomains": false,
"maxConcurrency": 10,
"outputFormat": "json"
}

Output Schema

Each domain produces a comprehensive audit report:

{
"summary": {
"domain": "nytimes.com",
"robotsTxtExists": true,
"robotsTxtUrl": "https://nytimes.com/robots.txt",
"httpStatus": 200,
"lastModified": "2026-01-15T00:00:00Z",
"syntaxValid": true,
"totalRules": 12,
"responseTimeMs": 245,
"aiCrawlerStatus": {
"allowedAI": 5,
"blockedAI": 8,
"unspecifiedAI": 48,
"defaultPolicy": "allow"
},
"aiReadinessScore": 72,
"aiReadinessGrade": "B"
},
"aiReadinessScore": {
"score": 72,
"grade": "B",
"breakdown": [
{ "factor": "Critical AI crawlers", "points": 35, "maxPoints": 35, "detail": "GPTBot: allowed (+12); ChatGPT-User: allowed (+12); OAI-SearchBot: allowed (+12)" },
{ "factor": "High-importance AI crawlers", "points": 15, "maxPoints": 20, "detail": "ClaudeBot: allowed (+5); Claude-SearchBot: allowed (+5); PerplexityBot: allowed (+5); Perplexity-User: BLOCKED (+0)" },
{ "factor": "Sitemap declared", "points": 10, "maxPoints": 10, "detail": "2 sitemap(s) declared and accessible" },
{ "factor": "Syntax quality", "points": 7, "maxPoints": 10, "detail": "1 high-severity syntax issue(s) found" },
{ "factor": "Search engines accessible", "points": 10, "maxPoints": 10, "detail": "All critical search engines allowed" },
{ "factor": "Wildcard policy", "points": 10, "maxPoints": 10, "detail": "Wildcard allows crawling (or no wildcard rule)" },
{ "factor": "Security posture", "points": 5, "maxPoints": 5, "detail": "No high-severity security path disclosures" }
]
},
"aiCrawlers": {
"allowed": [
{
"bot": "GPTBot",
"company": "OpenAI",
"purpose": "ChatGPT responses & training",
"status": "explicitly_allowed",
"impact": "Your content CAN be accessed by OpenAI."
}
],
"blocked": [
{
"bot": "CCBot",
"company": "Common Crawl",
"status": "explicitly_blocked",
"impact": "Your content CANNOT be accessed by Common Crawl."
}
],
"unspecified": [
{
"bot": "Bravebot",
"company": "Brave",
"status": "not_mentioned",
"defaultBehavior": "allowed"
}
]
},
"searchEngines": {
"Googlebot": { "status": "allowed", "restrictions": [], "crawlDelay": null },
"Bingbot": { "status": "partially_restricted", "restrictions": ["/admin"], "crawlDelay": null }
},
"syntaxIssues": [],
"securityConcerns": [],
"sitemaps": [
{ "url": "https://nytimes.com/sitemap.xml", "accessible": true, "httpStatus": 200, "source": "robots_txt" }
],
"recommendations": [
{
"priority": "high",
"category": "ai_visibility",
"title": "Unblock Perplexity-User for AI search visibility",
"description": "Perplexity-User is blocked.",
"implementation": "User-agent: Perplexity-User\nAllow: /"
}
],
"aiStrategyInsights": {
"currentPosture": "mixed",
"description": "Mixed strategy: Allowing AI search but blocking AI training.",
"suggestedStrategy": "Good approach. Review periodically as new AI crawlers emerge."
},
"competitorComparison": [],
"subdomainAudits": []
}

AI Readiness Score

The AI Readiness Score is a 0-100 metric that measures how discoverable your site is to AI systems. It provides a single, comparable number with a letter grade (A-F) and a detailed breakdown showing exactly where points are gained or lost.

Scoring Breakdown

FactorMax PointsHow It Works
Critical AI crawlers allowed35+12 each for GPTBot, ChatGPT-User, OAI-SearchBot
High-importance AI crawlers20+5 each for ClaudeBot, Claude-SearchBot, PerplexityBot, Perplexity-User
Sitemap declared1010 if declared in robots.txt, 5 if found at /sitemap.xml, 0 if none
No syntax errors1010 if clean, -3 per high-severity issue
Search engines not blocked1010 if Googlebot + Bingbot allowed, -5 per blocked
No wildcard Disallow: /1010 if wildcard allows, 0 if wildcard blocks root
No security path disclosures55 minus 1 per high-severity concern

Grades

GradeScore RangeMeaning
A90-100Excellent AI discoverability
B75-89Good, minor improvements possible
C55-74Fair, significant gaps in AI access
D35-54Poor, most AI crawlers blocked or no robots.txt
F0-34Critical, site is largely invisible to AI

Special Cases

  • No robots.txt (404): Score = 50 (D) — all crawlers allowed by default but no intentional control
  • Access restricted (403/401): Score = 40 (D) — crawlers default to allow-all but situation is abnormal

Example Use Cases

1. Agency Client Audit

Audit all client domains for AI visibility issues:

{
"domains": [
"client1.com",
"client2.com",
"client3.com"
],
"includeSitemap": true,
"maxConcurrency": 10
}

2. Competitive Analysis

Compare your AI crawler strategy against competitors:

{
"domains": ["mycompany.com"],
"checkCompetitors": ["competitor1.com", "competitor2.com", "competitor3.com"],
"includeSitemap": false
}

3. Enterprise Bulk Scan

Scan hundreds of domains for a portfolio audit:

{
"domains": ["site1.com", "site2.com", "...up to 1000 domains"],
"maxConcurrency": 25,
"outputFormat": "csv"
}

4. Subdomain Audit

Scan a domain and its common subdomains (www, blog, shop, api, docs, etc.):

{
"domains": ["example.com"],
"scanSubdomains": true,
"includeSitemap": true
}

5. Quick Single-Domain Check

Fast check for one domain:

{
"domains": ["example.com"]
}

Output Formats

JSON (Default)

Standard Apify dataset output. Each domain produces one JSON object in the dataset. Use the Overview dataset view for a quick summary table including the AI Readiness Score.

CSV

Flattened audit data saved to the key-value store as output-csv. Ideal for importing into Google Sheets or Excel for reporting.

How It Works

  1. Fetch — Retrieves /robots.txt via HTTPS (falls back to HTTP). Handles 404, 403, 5xx, redirects, and HTML catch-all responses. Retries transient failures.
  2. Parse — Extracts User-agent groups, Allow/Disallow rules, Crawl-delay, Sitemap declarations, and Host directives.
  3. Categorize — Maps every user-agent to our database of 80+ known crawlers across AI, search engines, SEO tools, and social media bots.
  4. Analyze — Determines access status for each of 60+ AI crawlers and 7 search engines. Identifies security concerns from sensitive path disclosures.
  5. Validate — Checks syntax for typos, malformed lines, orphaned directives, and non-standard extensions.
  6. Recommend — Generates prioritized, actionable recommendations with copy-paste robots.txt snippets.
  7. Score — Calculates the AI Readiness Score (0-100) with a detailed factor-by-factor breakdown.
  8. Subdomain Scan — Optionally checks 13 common subdomains for separate robots.txt files and runs the full audit pipeline on each.
  9. Compare — If competitors are provided, adds side-by-side AI strategy comparison.

Troubleshooting

Empty results or fetch errors

  • DNS failures: The domain may not exist or may be unreachable. Check the interpretation field in the output for details.
  • Timeouts: Some servers are slow. The actor retries transient failures automatically (2 retries with exponential backoff).
  • No robots.txt (404): This is a valid result — it means all crawlers are allowed by default. The actor still generates recommendations and an AI Readiness Score of 50.

All AI crawlers show as "unspecified"

This means the robots.txt has no AI-specific rules. All AI crawlers fall through to the wildcard (*) or default allow policy. The recommendations will suggest adding explicit rules.

Running locally

Install dependencies and run:

cd actors/robots-txt-audit
npm install
echo '{"domains":["example.com"]}' | npx apify-cli run --purge

Or run tests:

$npm test

Limitations

  • robots.txt only — This actor analyzes /robots.txt files. It does not crawl pages, check meta robots tags, or verify X-Robots-Tag headers.
  • Static analysis — The audit reflects the robots.txt content at fetch time. It does not monitor changes over time (planned for V2).
  • No Wayback Machine integration — Historical robots.txt analysis is planned for V2.
  • Standard compliance — The parser follows the Google robots.txt specification. Non-standard extensions (Crawl-delay, Host) are detected but flagged as non-standard.
  • Response size limit — robots.txt files larger than 1 MB are skipped to prevent memory issues (this is extremely rare).

FAQ

Q: How long does a typical run take? A: 10 domains takes about 5-10 seconds. 100 domains with concurrency 10 takes about 30-60 seconds. It's very fast since we're only fetching small text files. Adding subdomain scanning increases time proportionally.

Q: Do I need proxies? A: Usually not — robots.txt files are public and lightweight. Proxies are available if you're scanning from a datacenter IP that gets rate-limited.

Q: What if a domain has no robots.txt? A: The actor reports robotsTxtExists: false with the interpretation "all crawlers are allowed by default", generates a recommendation to create one, and assigns an AI Readiness Score of 50 (Grade D).

Q: How often is the AI crawler database updated? A: We track 60+ AI crawlers as of February 2026, sourced from Dark Visitors and our own research. The database is updated with each actor release as new AI crawlers emerge.

Q: What is the AI Readiness Score? A: It's a 0-100 metric that measures how AI-discoverable your site is. A score of 90+ (Grade A) means your site is fully optimized for AI search engines. Below 35 (Grade F) means most AI systems can't access your content.

Q: Can I use this to generate a robots.txt file? A: The recommendations include ready-to-paste robots.txt snippets. A full robots.txt generator is planned for V2.

Changelog

v1.1.0 (February 2026)

  • AI Readiness Score: New 0-100 headline metric with A-F grades and detailed 7-factor breakdown
  • 60+ AI crawlers: Expanded from 26 to 61 AI crawlers, now covering AI Agents (ChatGPT-Agent, NovaAct, Manus), AI Assistants (Gemini-Deep-Research, Devin, MistralAI-User), AI Data Scrapers (GoogleOther, ChatGLM-Spider), and AI Search (Bravebot, AzureAI-SearchBot, meta-webindexer)
  • Subdomain scanning: Optionally scan 13 common subdomains for separate robots.txt files
  • Expanded non-AI crawlers: Added rogerbot, Screaming Frog, SiteAuditBot (SEO) and Discordbot, WhatsApp, TelegramBot, Pinterestbot (social)
  • 267 unit tests

v1.0.1 (February 2026)

  • Added 7 AI crawlers (ClaudeBot, Claude-SearchBot, Claude-User, Google-CloudVertexBot, DuckAssistBot, Perplexity-User, Meta-ExternalFetcher)
  • 403/401 HTTP status handling with specific recommendations
  • Sitemap auto-discovery at /sitemap.xml when not declared in robots.txt
  • Crawl-delay detection and wildcard crawl-delay warnings
  • Output deduplication (removed knownBots redundancy, cleaned up wildcard-inherited search engine restrictions)
  • 218 unit tests

v1.0.0 (February 2026)

  • Initial public release
  • 19 AI crawlers tracked
  • 7 search engine crawlers, 4 SEO tool crawlers, 3 social media crawlers
  • Syntax validation with typo detection
  • Security concern identification
  • Prioritized recommendations with implementation snippets
  • Competitor comparison
  • Sitemap accessibility validation
  • CSV export
  • Proxy support via Apify proxy configuration
  • Retry logic with exponential backoff
  • 184 unit tests

Support

  • Issues: Report bugs via GitHub issues or the Apify community forum
  • Feature requests: Contact us through Apify or open a GitHub issue
  • Enterprise: For bulk scanning (10K+ domains/month), reach out for custom pricing

Built by A Page Ventures | Apify Store