robots.txt Parser & URL Tester avatar

robots.txt Parser & URL Tester

Pricing

from $1.00 / 1,000 results

Go to Apify Store
robots.txt Parser & URL Tester

robots.txt Parser & URL Tester

Fetch and parse robots.txt for any site: user-agent rules, crawl-delay, and declared sitemaps. Optionally test whether specific URLs are allowed for a given user-agent, using correct longest-match rules.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Nicolas van Arkens

Nicolas van Arkens

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

robots.txt Parser & URL Tester πŸ€–

Fetch and parse robots.txt for any site and get a clean, structured breakdown β€” per-user-agent allow/disallow rules, crawl-delay, and every declared sitemap. Optionally test whether specific URLs are allowed or blocked for a chosen crawler, using correct longest-match precedence.

Built for SEO audits, crawler and bot development, compliance checks, and anyone who needs to know what a site permits before crawling it.

Why use it

  • πŸ“‹ Structured rules β€” allow/disallow lists per user-agent, not raw text
  • πŸ€– User-agent aware β€” see the rules that actually apply to Googlebot, bingbot, or *
  • βœ… URL allow/deny testing β€” check exact paths against the rules with proper * wildcard, $ anchor, and longest-match logic
  • 🐌 Crawl-delay β€” extracted per user-agent
  • πŸ—ΊοΈ Sitemaps β€” every sitemap the site declares, ready to feed into a sitemap extractor
  • 🌐 Batch β€” check many sites at once

Use cases

  • SEO audits β€” verify a site isn't accidentally blocking important pages
  • Crawler development β€” respect robots.txt correctly before scraping
  • Compliance β€” confirm what a site permits for your user-agent
  • Sitemap discovery β€” pull declared sitemaps to drive further crawling
  • Monitoring β€” track robots.txt changes over time

Input

FieldDescription
SitesList of sites/URLs; robots.txt is fetched at each root.
User-agentWhich crawler's rules to apply (e.g. Googlebot, or *).
Test pathsOptional paths/URLs to test for allowed/blocked.

Output

{
"site": "https://example.com",
"robotsUrl": "https://example.com/robots.txt",
"success": true,
"userAgentChecked": "*",
"sitemaps": ["https://example.com/sitemap.xml"],
"userAgentsDeclared": ["*", "googlebot", "badbot"],
"appliedGroupDisallow": ["/private/", "/tmp/"],
"appliedGroupAllow": ["/private/public-page"],
"crawlDelay": 10,
"testResults": [
{ "path": "/private/secret", "allowed": false },
{ "path": "/private/public-page", "allowed": true }
]
}

Export to JSON, CSV, or Excel, or pull via the Apify API.

Notes

  • Implements standard robots.txt semantics: longest-match wins between Allow and Disallow, with * wildcards and $ end-anchors (per Google's specification).
  • A site with no robots.txt (404) is reported as such β€” by convention, that means all crawling is allowed.
  • Independent tool. Always honor robots.txt in your own crawling.