robots.txt Parser & URL Tester
Pricing
from $1.00 / 1,000 results
robots.txt Parser & URL Tester
Fetch and parse robots.txt for any site: user-agent rules, crawl-delay, and declared sitemaps. Optionally test whether specific URLs are allowed for a given user-agent, using correct longest-match rules.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Nicolas van Arkens
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
robots.txt Parser & URL Tester π€
Fetch and parse robots.txt for any site and get a clean, structured breakdown β per-user-agent allow/disallow rules, crawl-delay, and every declared sitemap. Optionally test whether specific URLs are allowed or blocked for a chosen crawler, using correct longest-match precedence.
Built for SEO audits, crawler and bot development, compliance checks, and anyone who needs to know what a site permits before crawling it.
Why use it
- π Structured rules β allow/disallow lists per user-agent, not raw text
- π€ User-agent aware β see the rules that actually apply to Googlebot, bingbot, or
* - β
URL allow/deny testing β check exact paths against the rules with proper
*wildcard,$anchor, and longest-match logic - π Crawl-delay β extracted per user-agent
- πΊοΈ Sitemaps β every sitemap the site declares, ready to feed into a sitemap extractor
- π Batch β check many sites at once
Use cases
- SEO audits β verify a site isn't accidentally blocking important pages
- Crawler development β respect robots.txt correctly before scraping
- Compliance β confirm what a site permits for your user-agent
- Sitemap discovery β pull declared sitemaps to drive further crawling
- Monitoring β track robots.txt changes over time
Input
| Field | Description |
|---|---|
| Sites | List of sites/URLs; robots.txt is fetched at each root. |
| User-agent | Which crawler's rules to apply (e.g. Googlebot, or *). |
| Test paths | Optional paths/URLs to test for allowed/blocked. |
Output
{"site": "https://example.com","robotsUrl": "https://example.com/robots.txt","success": true,"userAgentChecked": "*","sitemaps": ["https://example.com/sitemap.xml"],"userAgentsDeclared": ["*", "googlebot", "badbot"],"appliedGroupDisallow": ["/private/", "/tmp/"],"appliedGroupAllow": ["/private/public-page"],"crawlDelay": 10,"testResults": [{ "path": "/private/secret", "allowed": false },{ "path": "/private/public-page", "allowed": true }]}
Export to JSON, CSV, or Excel, or pull via the Apify API.
Notes
- Implements standard robots.txt semantics: longest-match wins between Allow and Disallow, with
*wildcards and$end-anchors (per Google's specification). - A site with no robots.txt (404) is reported as such β by convention, that means all crawling is allowed.
- Independent tool. Always honor robots.txt in your own crawling.