Robots Txt Audit
Pricing
from $5.00 / 1,000 results
Robots Txt Audit
Audit robots.txt files for AI crawler access. Get an AI Readiness Score (0-100), analyze 61+ AI crawlers (ChatGPT, Claude, Perplexity, Gemini), detect syntax errors, security concerns, and get actionable recommendations. Batch audit multiple domains at once with optional subdomain scanning.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer

Andy Page
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Robots.txt Audit Actor (AI Crawler Edition)
Audit robots.txt files for AI crawler management. Get an AI Readiness Score (0-100) for every domain, see which AI systems — ChatGPT, Claude, Perplexity, Gemini, and 60+ others — can access your content, detect syntax errors, flag security concerns, and get actionable recommendations to maximize your AI search visibility.
Why Use This Actor?
- AI Readiness Score — A single 0-100 headline metric that instantly tells you how AI-discoverable your site is, with a letter grade (A-F) and detailed breakdown
- 60+ AI crawlers tracked — The most comprehensive AI crawler database available, covering AI Agents, AI Assistants, AI Search, and AI Training bots
- Instant visibility — Know in seconds if ChatGPT, Claude, Perplexity, Brave, or Gemini can cite your content
- Subdomain scanning — Discover robots.txt blind spots on www, blog, shop, api, docs, and 8 other common subdomains
- Batch auditing — Audit hundreds of domains in a single run with configurable concurrency
- Actionable recommendations — Get prioritized fixes with ready-to-paste robots.txt snippets
- Security scanning — Detect sensitive path disclosures that actually help attackers
- Competitor comparison — See how your AI crawler strategy compares to competitors
Features
| Feature | Description |
|---|---|
| AI Readiness Score | 0-100 score with A-F grade measuring AI discoverability |
| AI Crawler Detection | Analyzes 60+ AI crawlers across AI Agent, AI Assistant, AI Search, AI Data Scraper, and AI Training categories |
| Subdomain Scanning | Checks 13 common subdomains for separate robots.txt files |
| Search Engine Analysis | Checks access for Googlebot, Bingbot, DuckDuckBot, Yandex, Baidu, and more |
| Syntax Validation | Detects typos, malformed lines, non-standard directives, and orphaned rules |
| Security Audit | Flags sensitive paths disclosed in robots.txt (admin panels, .env, .git, etc.) |
| Sitemap Validation | Verifies declared sitemaps and discovers undeclared sitemaps at common locations |
| Strategy Classification | Classifies your AI posture as open, restrictive, mixed, or undefined |
| Competitor Comparison | Side-by-side AI crawler strategy comparison across domains |
| Proxy Support | Optional Apify proxy integration for fetching from restricted networks |
| CSV Export | Flattened output for spreadsheet analysis |
AI Crawlers Tracked (60+)
AI Agents (5)
| Crawler | Company | Importance |
|---|---|---|
| ChatGPT-Agent | OpenAI | High |
| GoogleAgent-Mariner | Medium | |
| NovaAct | Amazon | Medium |
| AmazonBuyForMe | Amazon | Low |
| Manus-User | Butterfly Effect | Low |
AI Assistants (10)
| Crawler | Company | Importance |
|---|---|---|
| Gemini-Deep-Research | High | |
| Google-NotebookLM | Medium | |
| MistralAI-User | Mistral | Medium |
| PhindBot | Phind | Low |
| Amzn-User | Amazon | Low |
| kagi-fetcher | Kagi | Low |
| Ai2Bot-DeepResearchEval | AI2 | Low |
| Devin | Cognition | Low |
| TavilyBot | Tavily | Low |
| LinerBot | Liner | Low |
AI Search (20)
| Crawler | Company | Importance |
|---|---|---|
| GPTBot | OpenAI | Critical |
| ChatGPT-User | OpenAI | Critical |
| OAI-SearchBot | OpenAI | Critical |
| ClaudeBot | Anthropic | High |
| Claude-SearchBot | Anthropic | High |
| Claude-User | Anthropic | High |
| Claude-Web | Anthropic | High |
| PerplexityBot | Perplexity AI | High |
| Perplexity-User | Perplexity AI | High |
| Bravebot | Brave | Medium |
| AzureAI-SearchBot | Microsoft | Medium |
| DuckAssistBot | DuckDuckGo | Medium |
| Amazonbot | Amazon | Medium |
| meta-webindexer | Meta | Medium |
| FacebookBot | Meta | Low |
| Amzn-SearchBot | Amazon | Low |
| YouBot | You.com | Low |
| PetalBot | Huawei | Low |
| Cloudflare-AutoRAG | Cloudflare | Low |
| AddSearchBot / Anomura / atlassian-bot / Channel3Bot / LinkupBot / ZanistaBot | Various | Low |
AI Data Scrapers (9)
| Crawler | Company | Importance |
|---|---|---|
| GoogleOther | Medium | |
| imageSpider | ByteDance | Low |
| cohere-training-data-crawler | Cohere | Low |
| ChatGLM-Spider | Zhipu AI | Low |
| PanguBot | Huawei | Low |
| Timpibot | Timpi | Low |
| webzio-extended | Webz.io | Low |
| Kangaroo Bot | Kangaroo LLM | Low |
| VelenPublicWebCrawler | Velen/Hunter | Low |
AI Training (10)
| Crawler | Company | Importance |
|---|---|---|
| Google-Extended | High | |
| anthropic-ai | Anthropic | High |
| Google-CloudVertexBot | Medium | |
| Applebot-Extended | Apple | Medium |
| CCBot | Common Crawl | Medium |
| Bytespider | ByteDance/TikTok | Low |
| meta-externalagent | Meta | Low |
| Meta-ExternalFetcher | Meta | Low |
| Diffbot | Diffbot | Low |
| Omgilibot / cohere-ai / AI2Bot | Various | Low |
Plus 7 search engine crawlers, 7 SEO tool crawlers, and 8 social media crawlers.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
domains | string[] | Yes | — | Domains to audit (e.g., ["example.com", "github.com"]) |
checkCompetitors | string[] | No | [] | Competitor domains for side-by-side comparison |
includeSitemap | boolean | No | true | Validate sitemap URLs declared in robots.txt |
scanSubdomains | boolean | No | false | Scan 13 common subdomains (www, blog, shop, api, dev, staging, app, docs, help, support, cdn, mail, admin) for separate robots.txt files |
maxConcurrency | integer | No | 10 | Max domains to process in parallel (1-50) |
outputFormat | string | No | "json" | Output format: json or csv |
proxyConfig | object | No | Apify residential | Proxy configuration for fetching robots.txt files |
Example Input
{"domains": ["nytimes.com", "github.com", "shopify.com"],"checkCompetitors": ["medium.com"],"includeSitemap": true,"scanSubdomains": false,"maxConcurrency": 10,"outputFormat": "json"}
Output Schema
Each domain produces a comprehensive audit report:
{"summary": {"domain": "nytimes.com","robotsTxtExists": true,"robotsTxtUrl": "https://nytimes.com/robots.txt","httpStatus": 200,"lastModified": "2026-01-15T00:00:00Z","syntaxValid": true,"totalRules": 12,"responseTimeMs": 245,"aiCrawlerStatus": {"allowedAI": 5,"blockedAI": 8,"unspecifiedAI": 48,"defaultPolicy": "allow"},"aiReadinessScore": 72,"aiReadinessGrade": "B"},"aiReadinessScore": {"score": 72,"grade": "B","breakdown": [{ "factor": "Critical AI crawlers", "points": 35, "maxPoints": 35, "detail": "GPTBot: allowed (+12); ChatGPT-User: allowed (+12); OAI-SearchBot: allowed (+12)" },{ "factor": "High-importance AI crawlers", "points": 15, "maxPoints": 20, "detail": "ClaudeBot: allowed (+5); Claude-SearchBot: allowed (+5); PerplexityBot: allowed (+5); Perplexity-User: BLOCKED (+0)" },{ "factor": "Sitemap declared", "points": 10, "maxPoints": 10, "detail": "2 sitemap(s) declared and accessible" },{ "factor": "Syntax quality", "points": 7, "maxPoints": 10, "detail": "1 high-severity syntax issue(s) found" },{ "factor": "Search engines accessible", "points": 10, "maxPoints": 10, "detail": "All critical search engines allowed" },{ "factor": "Wildcard policy", "points": 10, "maxPoints": 10, "detail": "Wildcard allows crawling (or no wildcard rule)" },{ "factor": "Security posture", "points": 5, "maxPoints": 5, "detail": "No high-severity security path disclosures" }]},"aiCrawlers": {"allowed": [{"bot": "GPTBot","company": "OpenAI","purpose": "ChatGPT responses & training","status": "explicitly_allowed","impact": "Your content CAN be accessed by OpenAI."}],"blocked": [{"bot": "CCBot","company": "Common Crawl","status": "explicitly_blocked","impact": "Your content CANNOT be accessed by Common Crawl."}],"unspecified": [{"bot": "Bravebot","company": "Brave","status": "not_mentioned","defaultBehavior": "allowed"}]},"searchEngines": {"Googlebot": { "status": "allowed", "restrictions": [], "crawlDelay": null },"Bingbot": { "status": "partially_restricted", "restrictions": ["/admin"], "crawlDelay": null }},"syntaxIssues": [],"securityConcerns": [],"sitemaps": [{ "url": "https://nytimes.com/sitemap.xml", "accessible": true, "httpStatus": 200, "source": "robots_txt" }],"recommendations": [{"priority": "high","category": "ai_visibility","title": "Unblock Perplexity-User for AI search visibility","description": "Perplexity-User is blocked.","implementation": "User-agent: Perplexity-User\nAllow: /"}],"aiStrategyInsights": {"currentPosture": "mixed","description": "Mixed strategy: Allowing AI search but blocking AI training.","suggestedStrategy": "Good approach. Review periodically as new AI crawlers emerge."},"competitorComparison": [],"subdomainAudits": []}
AI Readiness Score
The AI Readiness Score is a 0-100 metric that measures how discoverable your site is to AI systems. It provides a single, comparable number with a letter grade (A-F) and a detailed breakdown showing exactly where points are gained or lost.
Scoring Breakdown
| Factor | Max Points | How It Works |
|---|---|---|
| Critical AI crawlers allowed | 35 | +12 each for GPTBot, ChatGPT-User, OAI-SearchBot |
| High-importance AI crawlers | 20 | +5 each for ClaudeBot, Claude-SearchBot, PerplexityBot, Perplexity-User |
| Sitemap declared | 10 | 10 if declared in robots.txt, 5 if found at /sitemap.xml, 0 if none |
| No syntax errors | 10 | 10 if clean, -3 per high-severity issue |
| Search engines not blocked | 10 | 10 if Googlebot + Bingbot allowed, -5 per blocked |
| No wildcard Disallow: / | 10 | 10 if wildcard allows, 0 if wildcard blocks root |
| No security path disclosures | 5 | 5 minus 1 per high-severity concern |
Grades
| Grade | Score Range | Meaning |
|---|---|---|
| A | 90-100 | Excellent AI discoverability |
| B | 75-89 | Good, minor improvements possible |
| C | 55-74 | Fair, significant gaps in AI access |
| D | 35-54 | Poor, most AI crawlers blocked or no robots.txt |
| F | 0-34 | Critical, site is largely invisible to AI |
Special Cases
- No robots.txt (404): Score = 50 (D) — all crawlers allowed by default but no intentional control
- Access restricted (403/401): Score = 40 (D) — crawlers default to allow-all but situation is abnormal
Example Use Cases
1. Agency Client Audit
Audit all client domains for AI visibility issues:
{"domains": ["client1.com","client2.com","client3.com"],"includeSitemap": true,"maxConcurrency": 10}
2. Competitive Analysis
Compare your AI crawler strategy against competitors:
{"domains": ["mycompany.com"],"checkCompetitors": ["competitor1.com", "competitor2.com", "competitor3.com"],"includeSitemap": false}
3. Enterprise Bulk Scan
Scan hundreds of domains for a portfolio audit:
{"domains": ["site1.com", "site2.com", "...up to 1000 domains"],"maxConcurrency": 25,"outputFormat": "csv"}
4. Subdomain Audit
Scan a domain and its common subdomains (www, blog, shop, api, docs, etc.):
{"domains": ["example.com"],"scanSubdomains": true,"includeSitemap": true}
5. Quick Single-Domain Check
Fast check for one domain:
{"domains": ["example.com"]}
Output Formats
JSON (Default)
Standard Apify dataset output. Each domain produces one JSON object in the dataset. Use the Overview dataset view for a quick summary table including the AI Readiness Score.
CSV
Flattened audit data saved to the key-value store as output-csv. Ideal for importing into Google Sheets or Excel for reporting.
How It Works
- Fetch — Retrieves
/robots.txtvia HTTPS (falls back to HTTP). Handles 404, 403, 5xx, redirects, and HTML catch-all responses. Retries transient failures. - Parse — Extracts User-agent groups, Allow/Disallow rules, Crawl-delay, Sitemap declarations, and Host directives.
- Categorize — Maps every user-agent to our database of 80+ known crawlers across AI, search engines, SEO tools, and social media bots.
- Analyze — Determines access status for each of 60+ AI crawlers and 7 search engines. Identifies security concerns from sensitive path disclosures.
- Validate — Checks syntax for typos, malformed lines, orphaned directives, and non-standard extensions.
- Recommend — Generates prioritized, actionable recommendations with copy-paste robots.txt snippets.
- Score — Calculates the AI Readiness Score (0-100) with a detailed factor-by-factor breakdown.
- Subdomain Scan — Optionally checks 13 common subdomains for separate robots.txt files and runs the full audit pipeline on each.
- Compare — If competitors are provided, adds side-by-side AI strategy comparison.
Troubleshooting
Empty results or fetch errors
- DNS failures: The domain may not exist or may be unreachable. Check the
interpretationfield in the output for details. - Timeouts: Some servers are slow. The actor retries transient failures automatically (2 retries with exponential backoff).
- No robots.txt (404): This is a valid result — it means all crawlers are allowed by default. The actor still generates recommendations and an AI Readiness Score of 50.
All AI crawlers show as "unspecified"
This means the robots.txt has no AI-specific rules. All AI crawlers fall through to the wildcard (*) or default allow policy. The recommendations will suggest adding explicit rules.
Running locally
Install dependencies and run:
cd actors/robots-txt-auditnpm installecho '{"domains":["example.com"]}' | npx apify-cli run --purge
Or run tests:
$npm test
Limitations
- robots.txt only — This actor analyzes
/robots.txtfiles. It does not crawl pages, check meta robots tags, or verify X-Robots-Tag headers. - Static analysis — The audit reflects the robots.txt content at fetch time. It does not monitor changes over time (planned for V2).
- No Wayback Machine integration — Historical robots.txt analysis is planned for V2.
- Standard compliance — The parser follows the Google robots.txt specification. Non-standard extensions (Crawl-delay, Host) are detected but flagged as non-standard.
- Response size limit — robots.txt files larger than 1 MB are skipped to prevent memory issues (this is extremely rare).
FAQ
Q: How long does a typical run take? A: 10 domains takes about 5-10 seconds. 100 domains with concurrency 10 takes about 30-60 seconds. It's very fast since we're only fetching small text files. Adding subdomain scanning increases time proportionally.
Q: Do I need proxies? A: Usually not — robots.txt files are public and lightweight. Proxies are available if you're scanning from a datacenter IP that gets rate-limited.
Q: What if a domain has no robots.txt?
A: The actor reports robotsTxtExists: false with the interpretation "all crawlers are allowed by default", generates a recommendation to create one, and assigns an AI Readiness Score of 50 (Grade D).
Q: How often is the AI crawler database updated? A: We track 60+ AI crawlers as of February 2026, sourced from Dark Visitors and our own research. The database is updated with each actor release as new AI crawlers emerge.
Q: What is the AI Readiness Score? A: It's a 0-100 metric that measures how AI-discoverable your site is. A score of 90+ (Grade A) means your site is fully optimized for AI search engines. Below 35 (Grade F) means most AI systems can't access your content.
Q: Can I use this to generate a robots.txt file? A: The recommendations include ready-to-paste robots.txt snippets. A full robots.txt generator is planned for V2.
Changelog
v1.1.0 (February 2026)
- AI Readiness Score: New 0-100 headline metric with A-F grades and detailed 7-factor breakdown
- 60+ AI crawlers: Expanded from 26 to 61 AI crawlers, now covering AI Agents (ChatGPT-Agent, NovaAct, Manus), AI Assistants (Gemini-Deep-Research, Devin, MistralAI-User), AI Data Scrapers (GoogleOther, ChatGLM-Spider), and AI Search (Bravebot, AzureAI-SearchBot, meta-webindexer)
- Subdomain scanning: Optionally scan 13 common subdomains for separate robots.txt files
- Expanded non-AI crawlers: Added rogerbot, Screaming Frog, SiteAuditBot (SEO) and Discordbot, WhatsApp, TelegramBot, Pinterestbot (social)
- 267 unit tests
v1.0.1 (February 2026)
- Added 7 AI crawlers (ClaudeBot, Claude-SearchBot, Claude-User, Google-CloudVertexBot, DuckAssistBot, Perplexity-User, Meta-ExternalFetcher)
- 403/401 HTTP status handling with specific recommendations
- Sitemap auto-discovery at /sitemap.xml when not declared in robots.txt
- Crawl-delay detection and wildcard crawl-delay warnings
- Output deduplication (removed knownBots redundancy, cleaned up wildcard-inherited search engine restrictions)
- 218 unit tests
v1.0.0 (February 2026)
- Initial public release
- 19 AI crawlers tracked
- 7 search engine crawlers, 4 SEO tool crawlers, 3 social media crawlers
- Syntax validation with typo detection
- Security concern identification
- Prioritized recommendations with implementation snippets
- Competitor comparison
- Sitemap accessibility validation
- CSV export
- Proxy support via Apify proxy configuration
- Retry logic with exponential backoff
- 184 unit tests
Support
- Issues: Report bugs via GitHub issues or the Apify community forum
- Feature requests: Contact us through Apify or open a GitHub issue
- Enterprise: For bulk scanning (10K+ domains/month), reach out for custom pricing
Built by A Page Ventures | Apify Store