Robots Txt Analyzer
Pricing
$2.99/month + usage
Robots Txt Analyzer
Robots txt analyzer that fetches and parses crawl rules from any website in bulk, so SEO teams and developers can audit blocked paths, user agents, and sitemap locations across hundreds of domains without manual work.
Robots.txt Analyzer: Parse and Validate Crawl Rules for Any Website
Robots.txt Analyzer fetches and parses robots.txt files for any website. Give it one domain or a list of hundreds and get back every directive: blocked paths, allowed paths, crawl delays, and sitemap URLs, organized by user agent. Most tools check robots.txt for a single site; this actor handles bulk analysis, so you can audit dozens of domains in a single run.
Use cases
- SEO auditing: check which pages Googlebot or Bingbot can access before pushing new content live
- Technical SEO review: audit robots.txt across dozens of client domains without opening each file manually
- Bot access testing: verify whether a specific URL path is blocked for any crawler before deployment
- Competitive analysis: compare robots.txt configurations across competitor domains to see what they protect from indexing
- Site monitoring: schedule regular runs to catch unexpected changes that could block search engine crawlers
- QA validation: confirm that robots.txt deployments match intended crawl rules after each release
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | Single website URL to analyze | |
urls | array | List of website URLs for bulk analysis. One URL per line. | |
userAgent | string | * | Crawler user agent to check rules for (e.g. Googlebot, Bingbot, *) |
checkPath | string | Specific URL path to check (e.g. /admin/) | |
maxUrls | integer | 100 | Maximum number of URLs to process per run |
timeoutSecs | integer | 300 | Overall actor timeout in seconds |
requestTimeoutSecs | integer | 30 | Per-request timeout in seconds |
proxyConfiguration | object | Datacenter (Anywhere) | Proxy type and location for requests. Supports Datacenter, Residential, Special, and custom proxies. Optional. |
Example input
{"urls": ["https://apify.com", "https://news.ycombinator.com"],"userAgent": "Googlebot","checkPath": "/admin/","maxUrls": 100,"proxyConfiguration": { "useApifyProxy": true }}
What data does this actor extract?
The actor stores one result per URL in the Apify dataset. Each entry contains:
{"url": "https://apify.com","robotsTxtUrl": "https://apify.com/robots.txt","httpStatus": 200,"isAccessible": true,"rawContent": "User-agent: *\nDisallow: /api/\nSitemap: https://apify.com/sitemap.xml","userAgentsFound": ["*", "Googlebot"],"sitemapUrls": ["https://apify.com/sitemap.xml"],"crawlDelay": null,"disallowedPaths": ["/api/"],"allowedPaths": [],"checkedUserAgent": "Googlebot","checkedPath": "/admin/","isPathBlocked": true,"matchingRule": "Disallow: /admin/","error": null,"scrapedAt": "2025-03-08T12:00:00+00:00"}
| Field | Type | Description |
|---|---|---|
url | string | Original website URL |
robotsTxtUrl | string | URL of the fetched robots.txt file |
httpStatus | integer | HTTP status code returned for the robots.txt request |
isAccessible | boolean | Whether robots.txt was found and returned HTTP 200 |
rawContent | string | Full raw text of the robots.txt file |
userAgentsFound | array | All user agents declared in the file |
sitemapUrls | array | Sitemap URLs declared in the file |
crawlDelay | number | Crawl delay in seconds for the checked user agent, if declared |
disallowedPaths | array | Paths disallowed for the checked user agent |
allowedPaths | array | Paths explicitly allowed for the checked user agent |
checkedUserAgent | string | User agent checked against the robots.txt rules |
checkedPath | string | Specific path checked for access, if provided |
isPathBlocked | boolean | Whether the checked path is blocked. Null if no path was provided. |
matchingRule | string | The specific rule that determined the access result |
error | string | Error message if the fetch failed |
scrapedAt | string | ISO 8601 timestamp of the analysis |
How it works
- The actor reads the input URL or list of URLs
- For each domain, it builds the robots.txt URL by appending
/robots.txtto the root - It fetches the file with an HTTP GET request
- The parser groups directives by user agent, reading each line in order
- It matches the configured user agent against the parsed groups, checking for an exact match first and falling back to the wildcard
*group - If a check path is provided, it applies the longest-match rule to determine access
- All results are pushed to the Apify dataset
Integrations
Connect Robots.txt Analyzer with other apps and services using Apify integrations. You can pipe results to Google Sheets, Airtable, or trigger Slack alerts via Make or Zapier whenever a path becomes blocked. You can also use webhooks to act on results as soon as a run finishes.
FAQ
Does this actor handle robots.txt files with multiple user-agent groups?
Yes. The parser reads every user-agent block and applies the correct rules for the configured user agent, with automatic fallback to the wildcard * group when no exact match is found.
What happens if a site has no robots.txt file?
The actor records an HTTP 404 status and sets isAccessible to false. A missing robots.txt means no restrictions, so isPathBlocked is set to false when a check path is provided.
How many URLs can I process per run?
Up to 1000 per run, controlled by the maxUrls input. The default is 100 to avoid accidental large runs on first use.
Can this actor check if Googlebot can access a specific page?
Yes. Set userAgent to Googlebot and checkPath to the path you want to check. The output includes isPathBlocked and matchingRule showing exactly which directive made the decision.
Does it handle robots.txt with wildcard path patterns like * and $?
The actor handles standard robots.txt directives: Disallow, Allow, Crawl-delay, and Sitemap. Wildcard characters within path patterns (* and $ mid-path) are not currently supported; only prefix matching is applied.
Use Robots.txt Analyzer for single-site spot checks or scheduled bulk audits across hundreds of domains. Export to Google Sheets and plug into your existing SEO workflow through the Apify platform.
