Pricing

from $1.00 / 1,000 results

robots.txt Parser & URL Tester

Fetch and parse robots.txt for any site: user-agent rules, crawl-delay, and declared sitemaps. Optionally test whether specific URLs are allowed for a given user-agent, using correct longest-match rules.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Nicolas van Arkens

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

.dockerignore

.git
.venv
__pycache__
*.pyc
storage
test_*.py

requirements.txt

1apify < 3.0
2pydantic >= 2.8, < 2.12
3browserforge == 1.2.3
4httpx

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "robots-txt",
    "title": "robots.txt Parser & URL Tester",
    "description": "Fetch and parse robots.txt for any site: user-agent rules, crawl-delay, and declared sitemaps. Optionally test whether specific URLs are allowed for a given user-agent, using correct longest-match rules.",
    "version": "0.1",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile",
    "storages": { "dataset": "./dataset_schema.json" }
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": { "fields": ["site","userAgentChecked","appliedGroupDisallow","appliedGroupAllow","crawlDelay","sitemaps","robotsUrl"] },
            "display": {
                "component": "table",
                "properties": {
                    "site": { "label": "Site", "format": "text" },
                    "userAgentChecked": { "label": "User-agent", "format": "text" },
                    "appliedGroupDisallow": { "label": "Disallow", "format": "array" },
                    "appliedGroupAllow": { "label": "Allow", "format": "array" },
                    "crawlDelay": { "label": "Crawl-delay", "format": "number" },
                    "sitemaps": { "label": "Sitemaps", "format": "array" },
                    "robotsUrl": { "label": "robots.txt", "format": "link" }
                }
            }
        }
    }
}

.actor/Dockerfile

FROM apify/actor-python:3.12
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . ./
CMD ["python", "-m", "src"]

.actor/input_schema.json

{
    "title": "robots.txt Parser & URL Tester",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "sites": {
            "title": "Sites",
            "type": "array",
            "description": "List of sites or URLs to check. The actor fetches /robots.txt at each site's root (scheme added automatically if omitted).",
            "editor": "stringList",
            "prefill": ["https://www.apify.com", "https://github.com"]
        },
        "userAgent": {
            "title": "User-agent",
            "type": "string",
            "description": "Which user-agent's rules to apply and report (e.g. 'Googlebot', 'bingbot'). Use '*' for the default/all-robots group.",
            "editor": "textfield",
            "default": "*"
        },
        "testPaths": {
            "title": "Test paths (optional)",
            "type": "array",
            "description": "Optional list of paths or URLs to test for the chosen user-agent. The output marks each as allowed or disallowed using standard longest-match rules. Example: ['/search', '/admin/', '/products/123'].",
            "editor": "stringList"
        }
    },
    "required": []
}

.actor/README.md

1# robots.txt Parser & URL Tester 🤖
2
3Fetch and parse **robots.txt** for any site and get a clean, structured breakdown — per-user-agent **allow/disallow rules**, **crawl-delay**, and every declared **sitemap**. Optionally test whether specific URLs are **allowed or blocked** for a chosen crawler, using correct longest-match precedence.
4
5Built for SEO audits, crawler and bot development, compliance checks, and anyone who needs to know what a site permits before crawling it.
6
7## Why use it
8
9- 📋 **Structured rules** — allow/disallow lists per user-agent, not raw text
10- 🤖 **User-agent aware** — see the rules that actually apply to Googlebot, bingbot, or `*`
11- ✅ **URL allow/deny testing** — check exact paths against the rules with proper `*` wildcard, `$` anchor, and longest-match logic
12- 🐌 **Crawl-delay** — extracted per user-agent
13- 🗺️ **Sitemaps** — every sitemap the site declares, ready to feed into a sitemap extractor
14- 🌐 **Batch** — check many sites at once
15
16## Use cases
17
18- **SEO audits** — verify a site isn't accidentally blocking important pages
19- **Crawler development** — respect robots.txt correctly before scraping
20- **Compliance** — confirm what a site permits for your user-agent
21- **Sitemap discovery** — pull declared sitemaps to drive further crawling
22- **Monitoring** — track robots.txt changes over time
23
24## Input
25
26| Field | Description |
27|-------|-------------|
28| **Sites** | List of sites/URLs; robots.txt is fetched at each root. |
29| **User-agent** | Which crawler's rules to apply (e.g. `Googlebot`, or `*`). |
30| **Test paths** | Optional paths/URLs to test for allowed/blocked. |
31
32## Output
33
34```json
35{
36  "site": "https://example.com",
37  "robotsUrl": "https://example.com/robots.txt",
38  "success": true,
39  "userAgentChecked": "*",
40  "sitemaps": ["https://example.com/sitemap.xml"],
41  "userAgentsDeclared": ["*", "googlebot", "badbot"],
42  "appliedGroupDisallow": ["/private/", "/tmp/"],
43  "appliedGroupAllow": ["/private/public-page"],
44  "crawlDelay": 10,
45  "testResults": [
46    { "path": "/private/secret", "allowed": false },
47    { "path": "/private/public-page", "allowed": true }
48  ]
49}
50```
51
52Export to JSON, CSV, or Excel, or pull via the Apify API.
53
54## Notes
55
56- Implements standard robots.txt semantics: longest-match wins between Allow and Disallow, with `*` wildcards and `$` end-anchors (per Google's specification).
57- A site with no robots.txt (404) is reported as such — by convention, that means all crawling is allowed.
58- Independent tool. Always honor robots.txt in your own crawling.

src/main.py

1"""robots.txt Parser & Tester — Apify Actor.
2
3Fetches and parses robots.txt for one or more sites, returning a structured
4breakdown: per-user-agent allow/disallow rules, crawl-delay, declared sitemaps,
5and (optionally) a verdict on whether specific test URLs are allowed for a
6chosen user-agent.
7
8Implements the standard robots.txt matching semantics: longest-match wins
9between Allow and Disallow, `*` wildcards, and `$` end-anchors.
10"""
11
12from __future__ import annotations
13
14import asyncio
15import re
16from urllib.parse import urljoin, urlparse
17
18import httpx
19from apify import Actor
20
21
22def parse_robots(text: str) -> dict:
23    """Parse robots.txt content into structured groups + sitemaps.
24
25    Returns {'groups': {ua: {'allow':[...], 'disallow':[...], 'crawlDelay':x}},
26             'sitemaps': [...]}.
27    """
28    groups: dict = {}
29    sitemaps: list[str] = []
30    current_uas: list[str] = []
31    # Track whether the previous non-empty line was a user-agent (to group
32    # consecutive user-agent lines that share the same rule block).
33    last_was_ua = False
34
35    for raw in text.splitlines():
36        line = raw.split("#", 1)[0].strip()  # strip comments
37        if not line:
38            continue
39        if ":" not in line:
40            continue
41        field, _, value = line.partition(":")
42        field = field.strip().lower()
43        value = value.strip()
44
45        if field == "user-agent":
46            if not last_was_ua:
47                current_uas = []
48            ua = value.lower()
49            current_uas.append(ua)
50            groups.setdefault(ua, {"allow": [], "disallow": [], "crawlDelay": None})
51            last_was_ua = True
52            continue
53
54        last_was_ua = False
55
56        if field == "sitemap":
57            if value:
58                sitemaps.append(value)
59        elif field in ("allow", "disallow") and current_uas:
60            for ua in current_uas:
61                groups[ua][field].append(value)
62        elif field == "crawl-delay" and current_uas:
63            try:
64                cd = float(value)
65            except ValueError:
66                cd = None
67            for ua in current_uas:
68                groups[ua]["crawlDelay"] = cd
69
70    return {"groups": groups, "sitemaps": sitemaps}
71
72
73def _rule_to_regex(pattern: str) -> re.Pattern:
74    """Convert a robots.txt path pattern to a regex (handles * and $)."""
75    # Escape regex specials except * and $, then translate.
76    out = []
77    i = 0
78    for ch in pattern:
79        if ch == "*":
80            out.append(".*")
81        elif ch == "$":
82            out.append("$")
83        else:
84            out.append(re.escape(ch))
85    return re.compile("^" + "".join(out))
86
87
88def _match_len(rules: list[str], path: str) -> int:
89    """Return the length of the longest matching rule pattern, or -1 if none."""
90    best = -1
91    for r in rules:
92        if r == "":
93            continue
94        try:
95            if _rule_to_regex(r).match(path):
96                # Match specificity = pattern length (standard heuristic).
97                best = max(best, len(r))
98        except re.error:
99            continue
100    return best
101
102
103def select_group(groups: dict, user_agent: str) -> dict | None:
104    """Pick the rule group for a user-agent (exact match, else '*')."""
105    ua = user_agent.lower()
106    if ua in groups:
107        return groups[ua]
108    # Substring match (e.g. 'googlebot/2.1' should match 'googlebot' group).
109    for key in groups:
110        if key != "*" and key and key in ua:
111            return groups[key]
112    return groups.get("*")
113
114
115def is_allowed(parsed: dict, url_path: str, user_agent: str) -> bool:
116    """Determine if url_path is allowed for user_agent per longest-match rule."""
117    group = select_group(parsed["groups"], user_agent)
118    if not group:
119        return True  # no rules -> allowed
120    allow_len = _match_len(group["allow"], url_path)
121    disallow_len = _match_len(group["disallow"], url_path)
122    if disallow_len == -1:
123        return True
124    if allow_len == -1:
125        return False
126    # Longest match wins; ties go to Allow (per Google's spec).
127    return allow_len >= disallow_len
128
129
130async def fetch_robots(client: httpx.AsyncClient, robots_url: str, log) -> tuple[str | None, int | None]:
131    for attempt in range(1, 4):
132        try:
133            resp = await client.get(robots_url)
134            return (resp.text if resp.status_code < 400 else None), resp.status_code
135        except httpx.HTTPError as exc:
136            log.warning(f"Fetch attempt {attempt} failed for {robots_url}: {exc}")
137            if attempt < 3:
138                await asyncio.sleep(attempt * 2)
139    return None, None
140
141
142async def main() -> None:
143    async with Actor:
144        actor_input = await Actor.get_input() or {}
145        sites = [s.strip() for s in (actor_input.get("sites", []) or []) if s and s.strip()]
146        user_agent = (actor_input.get("userAgent") or "*").strip() or "*"
147        test_paths = [p.strip() for p in (actor_input.get("testPaths", []) or []) if p and p.strip()]
148
149        if not sites:
150            Actor.log.warning("No sites provided.")
151            await Actor.push_data([])
152            return
153
154        async with httpx.AsyncClient(
155            timeout=30.0, follow_redirects=True,
156            headers={"User-Agent": "scrapeworks-robots-txt/0.1 (https://apify.com/scrapeworks)"},
157        ) as client:
158            for site in sites:
159                # Normalize to a robots.txt URL at the site root.
160                if not re.match(r"^https?://", site, re.I):
161                    site = "https://" + site
162                parsed_url = urlparse(site)
163                robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
164
165                text, status = await fetch_robots(client, robots_url, Actor.log)
166                if text is None:
167                    await Actor.push_data([{
168                        "site": site, "robotsUrl": robots_url, "success": False,
169                        "statusCode": status, "error": "No robots.txt found or not reachable",
170                    }])
171                    Actor.log.info(f"  {robots_url}: not available (status {status})")
172                    continue
173
174                parsed = parse_robots(text)
175
176                # Build per-UA rule summary.
177                group = select_group(parsed["groups"], user_agent)
178                record = {
179                    "site": site,
180                    "robotsUrl": robots_url,
181                    "success": True,
182                    "statusCode": status,
183                    "userAgentChecked": user_agent,
184                    "sitemaps": parsed["sitemaps"],
185                    "userAgentsDeclared": sorted(parsed["groups"].keys()),
186                    "appliedGroupAllow": group["allow"] if group else [],
187                    "appliedGroupDisallow": group["disallow"] if group else [],
188                    "crawlDelay": group["crawlDelay"] if group else None,
189                }
190
191                # Optional URL allow/deny testing.
192                if test_paths:
193                    results = []
194                    for p in test_paths:
195                        # Accept full URLs or bare paths.
196                        path = urlparse(p).path or "/" if p.startswith("http") else (p if p.startswith("/") else "/" + p)
197                        results.append({"path": p, "allowed": is_allowed(parsed, path, user_agent)})
198                    record["testResults"] = results
199
200                await Actor.push_data([record])
201                Actor.log.info(f"  {robots_url}: parsed, {len(parsed['sitemaps'])} sitemap(s), "
202                               f"{len(parsed['groups'])} UA group(s)")
203
204        Actor.log.info("Done.")

src/main.py

1import asyncio
2from .main import main
3
4asyncio.run(main())

robots.txt & Sitemap Analyzer

bgfc97/robots-txt-sitemap-analyzer

Fetch and parse robots.txt for many domains: user-agent groups, allow/disallow rules, crawl-delay, declared sitemaps and whether crawlers are blocked. Technical SEO auditing. No key.

Bruno

Robots.txt Checker & Parser - Crawl Rules API

pink_comic/robots-txt-validator

Check, parse, and validate robots.txt files in bulk. Extract crawl rules, sitemaps, crawl-delay, blocked paths, and per-user-agent allow/disallow results for SEO audits and crawler compliance.

Ava Torres

Robots.txt Validator - Check Rules, Sitemaps & Crawl Directives

scrappy_garden/robots-txt-validator

Validate robots.txt for one or more websites: fetches /robots.txt per host, parses directive groups (User-agent/Allow/Disallow/Crawl-delay/Sitemap), reports common errors and warnings, and can test URLs against the chosen User-Agent.

Bikram Adhikari

Robots Sitemap Analyzer - SEO Crawl Rules

benthepythondev/robots-sitemap-analyzer

Analyze robots.txt files and discover sitemap URLs, user-agent groups, allow rules, disallow rules and crawl-delay directives.

Ben

Robots.txt & Sitemap Analyzer 🕷️

perryay/robots-txt-sitemap-analyzer

Fetch, parse, and analyze robots.txt and sitemap.xml for any domain. Extract crawl directives, test URL compliance against robots.txt rules, and discover all URLs from sitemaps including nested sitemap indexes. Supports batch analysis with structured JSON output.

Perry AY

Robots.txt Auditor

junipr/robots-txt-auditor

Fetch and audit robots.txt syntax, user-agent rules, blocked paths, sitemap declarations, and crawl risks.

junipr

Robots.txt Analyzer

sootesting/robots-txt-analyzer

Analyze robots.txt for any list of domains — crawl rules per user-agent, declared sitemaps, crawl-delay, and indexing red flags (e.g. 'Disallow: /' blocking the whole site). One clean report per domain. Pay-per-event: $0.01 per batch of up to 200 domains.

soot

Robots.txt Analyzer

mahogany_songbird/robots-txt-analyzer

Read robots.txt disallow rules and sitemap declarations.

Britton Furness

Robots.txt & Sitemap Extractor by Domain

technicaldost/robots-sitemap-extractor

Fetch and parse robots.txt and XML sitemaps for any domain. Extract allowed and disallowed paths, sitemap URLs and every listed page. Great for SEO audits and crawl planning. JSON output.

Technical Dost Solutions

Robots.txt & Sitemap Analyzer

automation-lab/robots-sitemap-analyzer

This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive...