Indexability Audit
Pricing
$4.99/month + usage
Indexability Audit
Indexability audit tool that checks robots.txt, meta robots tags, X-Robots-Tag headers, and canonical URLs for any list of pages, so SEO teams know which ones Google can actually crawl and index.
Pricing
$4.99/month + usage
Rating
0.0
(0)
Developer

ZeroBreak
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 hours ago
Last modified
Categories
Share
Indexability Audit: Check Which Pages Search Engines Can Index
Indexability audit tool that scans any list of URLs and tells you which pages Google and other search engines can actually crawl and index. For each URL, it checks robots.txt rules, meta robots tags, X-Robots-Tag response headers, canonical tags, and HTTP status codes, then returns a pass or fail with the specific reason.
Run it before a site launch, after a migration, or when organic traffic drops and you need to know which pages have gone dark in search.
Use cases
- Site audits: scan hundreds of pages to confirm that product pages, blog posts, and landing pages are all indexable
- Pre-launch checks: verify pages before go-live so you don't ship with accidental noindex tags baked in
- Post-migration recovery: after a domain move or CMS switch, confirm canonical tags and robots.txt rules are set up correctly
- Traffic investigation: when rankings drop without warning, check whether key pages are still visible to search engines
- Ongoing monitoring: run on a schedule to catch noindex regressions before they affect rankings
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | A single URL to audit. Use this, urls, or both. | |
urls | array | A list of URLs to audit, one per line. | |
checkRobotsTxt | boolean | true | Fetch and check the robots.txt file for each domain. |
maxUrls | integer | 100 | Maximum number of URLs to process per run. Hard cap: 1000. |
timeoutSecs | integer | 300 | Total actor timeout in seconds. |
requestTimeoutSecs | integer | 30 | Per-request timeout in seconds. Increase for slow sites. |
proxyConfiguration | object | Datacenter (Anywhere) | Proxy type and location for requests. Supports Datacenter, Residential, Special, and custom proxies. Optional. |
Example input
{"urls": ["https://apify.com","https://apify.com/blog","https://apify.com/about"],"checkRobotsTxt": true,"maxUrls": 100,"requestTimeoutSecs": 30,"proxyConfiguration": { "useApifyProxy": true }}
What data does this actor extract?
The actor saves one result per URL in the dataset.
{"url": "https://apify.com/blog","finalUrl": "https://apify.com/blog/","httpStatus": 200,"isIndexable": true,"indexabilityIssues": [],"metaRobotsContent": "","xRobotsTag": "","canonicalUrl": "https://apify.com/blog/","isSelfCanonical": true,"robotsTxtBlocked": false,"redirectChain": [],"pageTitle": "Blog | Apify","metaDescription": "News and tutorials about web scraping and automation.","checkedAt": "2025-03-05T10:23:00.000Z"}
| Field | Type | Description |
|---|---|---|
url | string | Original URL from input |
finalUrl | string | Final URL after following any redirects |
httpStatus | integer | HTTP response status code (200, 301, 404, etc.) |
isIndexable | boolean | Whether the page passes all indexability checks |
indexabilityIssues | array | List of issues found. Empty if the page is indexable. |
metaRobotsContent | string | Content of the meta robots tag, e.g. noindex, nofollow |
xRobotsTag | string | Value of the X-Robots-Tag HTTP response header |
canonicalUrl | string | Canonical URL from the <link rel="canonical"> tag |
isSelfCanonical | boolean | Whether the canonical tag points back to the same page |
robotsTxtBlocked | boolean | Whether the URL path is blocked by robots.txt |
redirectChain | array | List of intermediate URLs in any redirect chain |
pageTitle | string | Content of the <title> tag |
metaDescription | string | Content of the meta description tag |
checkedAt | string | ISO 8601 timestamp of when the URL was checked |
How it works
- Takes the input URL list, merges single and multi-URL inputs, and deduplicates
- Fetches each page with a standard browser user-agent, tracking redirects manually to capture the full chain
- Reads the X-Robots-Tag response header for noindex directives
- Parses the HTML to extract the meta robots tag, canonical tag, title, and meta description
- Optionally fetches robots.txt once per domain and checks whether the URL path is disallowed
- Marks a page as indexable if the HTTP status is under 400 and none of the above checks fail
- Saves the full result to the dataset
FAQ
What causes a page to fail the indexability check? Any of the following: HTTP 400+ status, meta robots noindex, X-Robots-Tag noindex, a canonical tag pointing to a different URL, or a robots.txt disallow rule matching the path.
Does it handle JavaScript-rendered pages? No. It fetches raw HTML. Most sites include the meta tags that matter for indexability in the initial server response, so this covers the majority of cases. If your site injects meta robots tags only via JavaScript, those will not be detected here.
How many URLs can I check per run?
Default is 100. Raise it up to 1,000 with the maxUrls input. For larger audits, split the list across multiple runs.
Will target sites block my requests? This actor sends a standard Chrome user-agent. Most sites allow normal crawling. If a site blocks repeated requests, enabling a proxy in the input usually resolves it.
Can I audit URLs directly from a sitemap?
Not directly. Extract the URLs from your sitemap first, then paste them into the urls input field.
Does it detect noindex in HTTP headers as well as meta tags? Yes. It checks both the X-Robots-Tag HTTP response header and the HTML meta robots tag independently.
Integrations
Connect Indexability Audit with other apps and services using Apify integrations. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and many more. You can also use webhooks to trigger actions whenever results are available.