Robots.txt & Sitemap Analyzer
Pricing
Pay per event
Robots.txt & Sitemap Analyzer
This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive...
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
5
Total users
2
Monthly active users
4 hours ago
Last modified
Categories
Share
Fetch and analyze robots.txt and sitemap.xml for any website. Returns crawl rules (allow/disallow paths), sitemap locations, URL counts, and crawl-delay directives.
What does Robots.txt & Sitemap Analyzer do?
This actor fetches and parses robots.txt and sitemap.xml files for any list of websites. It extracts crawl directives (user-agent rules, allowed/disallowed paths, crawl-delay), discovers sitemap URLs, and counts the number of pages listed in each sitemap. Use it for SEO audits, competitive analysis, or monitoring crawl policies at scale.
Use cases
- SEO specialists -- audit robots.txt rules and sitemap coverage across client websites
- Competitive analysts -- compare crawl policies and indexed page counts across competitors
- Site reliability engineers -- monitor changes in crawl rules or sitemap size over time
- Web scraping engineers -- estimate site size and check allowed paths before building a crawler
- Technical SEO consultants -- find missing sitemaps or misconfigured crawl directives during site audits
Why use Robots.txt & Sitemap Analyzer?
- Bulk analysis -- check hundreds of websites in a single run
- Robots.txt parsing -- extracts all user-agent rules, allow/disallow paths, and crawl-delay values
- Sitemap discovery -- finds sitemaps declared in robots.txt and falls back to /sitemap.xml
- Sitemap parsing -- distinguishes sitemap indexes from URL sets and counts pages
- Raw text included -- returns the full robots.txt text for custom analysis
- Fast and lightweight -- HTTP-only with no browser needed, so results come back quickly
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | -- | List of websites to analyze. Domain names without protocol are auto-prefixed with https:// |
parseSitemaps | boolean | No | true | Fetch and parse sitemap.xml files to count URLs |
maxSitemapUrls | integer | No | 1000 | Maximum number of URLs to count per sitemap file (100-50,000) |
Example input
{"urls": ["apify.com", "google.com", "github.com"],"parseSitemaps": true,"maxSitemapUrls": 1000}
Output example
{"url": "https://apify.com","robotsTxt": {"exists": true,"rules": [{"userAgent": "*","allow": ["/"],"disallow": ["/api/", "/admin/"],"crawlDelay": null}],"sitemapUrls": ["https://apify.com/sitemap.xml"],"rawText": "User-agent: *\nAllow: /\nDisallow: /api/\nDisallow: /admin/\nSitemap: https://apify.com/sitemap.xml"},"sitemaps": [{"url": "https://apify.com/sitemap.xml","urlCount": 542,"type": "urlset","fetchError": null}],"totalSitemapUrls": 542,"checkTimeMs": 1234,"error": null,"checkedAt": "2026-03-01T12:00:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | The analyzed website URL |
robotsTxt.exists | boolean | Whether robots.txt was found |
robotsTxt.rules | array | Parsed user-agent rules with allow/disallow paths |
robotsTxt.sitemapUrls | string[] | Sitemap URLs declared in robots.txt |
robotsTxt.rawText | string | Full robots.txt content |
sitemaps | array | Parsed sitemap info with URL counts |
totalSitemapUrls | number | Total URLs found across all sitemaps |
checkTimeMs | number | Analysis time in milliseconds |
error | string | Error message if analysis failed |
checkedAt | string | ISO timestamp of the check |
How much does it cost to analyze robots.txt and sitemaps?
The actor uses Apify's pay-per-event pricing. You only pay for what you use.
| Event | Price | Description |
|---|---|---|
| Start | $0.035 | One-time per run |
| Site analyzed | $0.001 | Per website analyzed |
Example costs:
- 5 websites: $0.035 + 5 x $0.001 = $0.04
- 100 websites: $0.035 + 100 x $0.001 = $0.135
- 1,000 websites: $0.035 + 1,000 x $0.001 = $1.035
How to analyze robots.txt and sitemaps
- Go to the Robots.txt & Sitemap Analyzer page on Apify
- Enter one or more website URLs in the URLs field (e.g.,
apify.com,google.com) - Choose whether to enable sitemap parsing (enabled by default)
- Set the Max Sitemap URLs limit if needed (default is 1,000)
- Click Start and wait for results
- Download the analysis as JSON, CSV, or Excel
Using the Apify API
You can start Robots.txt & Sitemap Analyzer programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('automation-lab/robots-sitemap-analyzer').call({urls: ['apify.com', 'google.com'],parseSitemaps: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('automation-lab/robots-sitemap-analyzer').call(run_input={'urls': ['apify.com', 'google.com'],'parseSitemaps': True,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
cURL
curl "https://api.apify.com/v2/acts/automation-lab~robots-sitemap-analyzer/runs" \-X POST \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_TOKEN" \-d '{"urls": ["apify.com", "google.com"], "parseSitemaps": true}'
Use with AI agents via MCP
Robots.txt & Sitemap Analyzer is available as a tool for AI assistants via the Model Context Protocol (MCP).
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/robots-sitemap-analyzer"
Setup for Claude Desktop, Cursor, or VS Code
{"mcpServers": {"apify": {"url": "https://mcp.apify.com?tools=automation-lab/robots-sitemap-analyzer"}}}
Example prompts
- "Analyze robots.txt for example.com"
- "Check the sitemap structure for our website"
Learn more in the Apify MCP documentation.
Integrations
Robots.txt & Sitemap Analyzer works with all major automation platforms available on Apify. Export results to Google Sheets to build a crawl policy dashboard across all your monitored sites. Use Zapier or Make to schedule weekly checks and get notified when robots.txt rules change. Send alerts to Slack when a sitemap disappears or URL count drops significantly. Pipe results into n8n workflows for custom processing, or set up webhooks to trigger downstream actions as soon as a run finishes.
Tips and best practices
- Use domain names without protocol -- the actor auto-prefixes
https://so you can just enterapify.cominstead ofhttps://apify.com - Increase
maxSitemapUrlsfor large sites -- the default of 1,000 is fast but may undercount sites with tens of thousands of pages; increase to 50,000 for accurate counts - Set
parseSitemapsto false if you only need robots.txt rules -- this speeds up the run by skipping sitemap fetching - Schedule regular runs to detect when competitors change their crawl policies or sitemap structure
- Combine with Sitemap URL Extractor to first analyze sitemaps here, then extract the full URL list from the sitemaps that matter
Legality
This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.
FAQ
What happens if a website has no robots.txt?
The result will show robotsTxt.exists: false and empty rules. The actor will still try to find and parse sitemap.xml at the default location.
Does the actor follow sitemap index files? Yes. When a sitemap is a sitemap index (containing links to other sitemaps), the actor follows the child sitemaps and counts URLs across all of them.
The sitemap URL count is lower than expected. Why?
By default, the actor counts up to 1,000 URLs per sitemap file (maxSitemapUrls). If a sitemap has more URLs, the count will be capped at the limit. Increase maxSitemapUrls to 50,000 for large sites to get an accurate count. Also, some sites use sitemap indexes with many child sitemaps -- the actor follows these, but very large sitemap trees may take longer to process.
The actor found no sitemaps but my site has one. What happened?
The actor looks for sitemaps declared in robots.txt first, then falls back to checking /sitemap.xml. If your sitemap is at a non-standard location (e.g., /sitemap_index.xml or /sitemaps/main.xml) and is not listed in robots.txt, the actor will not find it. Add a Sitemap: directive to your robots.txt pointing to your sitemap location.
Can I analyze sites that require authentication? No. The actor uses plain HTTP requests and cannot handle login-protected robots.txt or sitemap files. It works with publicly accessible files only.
How do I check if a website's robots.txt is blocking search engines?
To check if a website's robots.txt is blocking Googlebot or other search engine crawlers, fetch the robots.txt file and look for Disallow directives under the User-agent: * or User-agent: Googlebot blocks. Robots.txt & Sitemap Analyzer returns all parsed rules in structured JSON, so you can instantly see which paths are blocked for each user agent.
A common misconfiguration is Disallow: / under User-agent: * — this blocks all crawlers from indexing the entire site. Another is forgetting to remove a staging-environment robots.txt rule (Disallow: /) after launching a site, which has accidentally de-indexed thousands of sites over the years. With bulk analysis, you can audit hundreds of client or competitor sites in a single run to catch these issues.
How do I find all sitemaps for a website?
Websites can have multiple sitemaps: a main sitemap.xml, a sitemap index that links to sub-sitemaps (one per content type or section), and news or video sitemaps. To find them all you need to:
- Check
robots.txtforSitemap:directives — this is where most well-configured sites list their sitemap locations. - Try the default location
/sitemap.xmlas a fallback. - Follow any sitemap index files to discover child sitemaps.
Robots.txt & Sitemap Analyzer automates all three steps. It reads the Sitemap: entries in robots.txt, fetches each sitemap, detects whether it is a sitemap index or URL set, and recursively follows child sitemaps. The output shows the URL count for each discovered sitemap and a combined totalSitemapUrls figure.
How can I compare competitor website sizes using sitemaps?
Sitemap URL counts are a useful proxy for website size and content volume. A competitor with 50,000 URLs in their sitemap likely has a much larger content moat than one with 500. You can use this data to:
- Benchmark your own indexed page count against competitors in your niche.
- Identify competitors who are aggressively publishing new content (sitemap grows fast over repeated checks).
- Spot thin-content sites that have a large URL count relative to their traffic — a signal of low content quality.
To compare competitors, input a list of their domains, enable sitemap parsing, and set maxSitemapUrls to 50,000 for accuracy. Schedule the run weekly to track how competitor content volumes change over time.
What is the difference between robots.txt and a sitemap?
robots.txt is an instruction file for crawlers — it tells search engines which parts of a site they are allowed to visit and how fast they may crawl. It does not affect which pages are indexed if those pages are linked from elsewhere; it only controls crawler access.
A sitemap (typically sitemap.xml) is a directory of URLs the site owner wants search engines to discover and index. Sitemaps help with crawl efficiency, especially for large sites or pages with few inbound links. Including a URL in a sitemap does not guarantee indexing, but it does signal to Google that the page should be crawled.
Together, robots.txt and sitemaps form the foundation of a site's crawl configuration. Robots.txt & Sitemap Analyzer reads both files in a single request, giving you a complete picture of how a site manages crawler access and content discovery.
Can I monitor robots.txt changes automatically?
Yes. One of the most powerful use cases for Robots.txt & Sitemap Analyzer is scheduled monitoring. A competitor changing their robots.txt to block a previously crawlable section often signals a strategic content change — they may be hiding a new product category, blocking price comparison crawlers, or restructuring their site.
To monitor changes automatically:
- Schedule a daily or weekly run via Apify Schedules.
- Export results to Google Sheets to build a change log.
- Use a Zapier or Make integration to send a Slack alert whenever the
rawTextcontent orsitemapUrlslist changes compared to a previous run.
This gives you early warning when competitors alter their crawl policies — without any manual checking.
Other SEO tools
- Redirect Chain Analyzer — trace HTTP redirect chains and detect issues
- Broken Link Checker — find broken links across your website
- HTTP Status Checker — check HTTP status codes for a list of URLs
- Sitemap URL Extractor — extract all URLs from XML sitemaps
- Website Health Report — comprehensive website health audit