# Broken Link Checker — Recursive Site Crawler (`accurate_pouch/broken-link-checker`) Actor

Recursively crawl your website and find every broken link, 404, redirect, and timeout. Checks internal and external links with configurable depth. 100 links free per run.

- **URL**: https://apify.com/accurate\_pouch/broken-link-checker.md
- **Developed by:** [Manchitt Sanan](https://apify.com/accurate_pouch) (community)
- **Categories:** SEO tools, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Broken Link Checker — Recursive Site Crawler

Find every broken link on your website. Recursively crawl from any start URL and report all 404 errors, bad redirects, timeouts, and server errors — with the exact page and anchor text where each broken link was found.

---

### Why this exists

Broken links hurt your SEO rankings, frustrate visitors, and make your site look unmaintained. Manually checking links on a 500-page site takes hours. This actor crawls your entire site in minutes, checks every internal and external link, and gives you a structured report.

- **Recursive crawling** — follows internal links up to a configurable depth, not just one page
- **External link checking** — lightweight HEAD requests to verify links to other domains
- **Status categorization** — every link classified as broken (404/410/5xx), redirect (301/302), timeout, or OK
- **Severity levels** — critical (404, 5xx), warning (redirects, timeouts), info (working links)
- **Context** — shows which page the broken link was found on and what the anchor text says
- **100 links** — try it on your site with zero risk

---

### Quick start

```json
{
    "startUrl": "https://example.com",
    "maxDepth": 3,
    "maxPages": 500,
    "checkExternalLinks": true
}
````

Hit **Start** and get a full report in minutes.

***

### Feature comparison

| Feature | HTTP Status Checker | parseforge | **This actor** |
|---------|-------------------|------------|--------------|
| Single URL check | Yes | Yes | Yes |
| Recursive site crawl | No | Yes | Yes |
| External link checking | No | Yes | Yes |
| Status categorization | No | Basic | 404/301/302/500/timeout |
| Severity classification | No | No | critical / warning / info |
| Anchor text context | No | No | Yes |
| Source page tracking | No | Yes | Yes |
| Configurable depth | No | Yes | Yes |
| Configurable max pages | No | Yes | Yes |
| Respect robots.txt | No | No | Yes (configurable) |
| URL pattern exclusion | No | No | Yes (glob patterns) |
| Dry run mode | No | No | Yes |
| Free tier | No | No | **100 links free** |

***

### Input

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `startUrl` | string | *(required)* | URL to start crawling from |
| `maxDepth` | integer | `3` | Maximum link depth to follow (1–10) |
| `maxPages` | integer | `500` | Maximum pages to crawl (1–10,000) |
| `checkExternalLinks` | boolean | `true` | Check links pointing to other domains |
| `respectRobotsTxt` | boolean | `true` | Skip pages disallowed by robots.txt |
| `ignoredPatterns` | array | `[]` | URL patterns to skip (glob-style: `*logout*`, `*admin*`) |
| `outputFormat` | enum | `broken-only` | `broken-only` or `all` |
| `sitemapUrl` | string | *(auto-detect)* | URL to sitemap.xml. If not set, auto-checks `/sitemap.xml` and `/sitemap_index.xml` |
| `webhookUrl` | string | *(optional)* | POST full JSON results to this URL when audit completes |
| `googleSheetsId` | string | *(optional)* | Export broken links to this Google Sheet (spreadsheet ID) |
| `googleServiceAccountKey` | string | *(optional)* | Google Service Account JSON key for Sheets export |
| `dryRun` | boolean | `false` | Preview what would be crawled — no charges |

***

### Output

```json
{
    "status": "success",
    "startUrl": "https://example.com",
    "summary": {
        "pagesChecked": 142,
        "linksChecked": 1847,
        "brokenLinks": 12,
        "redirects": 34,
        "errors": 3
    },
    "brokenLinks": [
        {
            "url": "https://example.com/old-page",
            "statusCode": 404,
            "statusCategory": "broken",
            "severity": "critical",
            "foundOn": "https://example.com/blog/post-1",
            "anchorText": "Learn more",
            "lastChecked": "2026-04-13T10:30:00Z",
            "error": null
        }
    ]
}
```

#### Status categories

| Category | HTTP codes | Severity | Meaning |
|----------|-----------|----------|---------|
| `broken` | 404, 410, 5xx | critical | Link target is dead or server is failing |
| `redirect` | 301, 302, 303, 307, 308 | warning | Link works but goes through a redirect — consider updating |
| `timeout` | — | warning | Server did not respond within 10 seconds |
| `error` | — | critical | Network error, DNS failure, or connection refused |
| `ok` | 2xx | info | Link is working (only shown in `all` output mode) |

***

### Pricing

**$0.003 per link checked** (pay-per-event pricing).

- Only charged on successful runs — errors and dry runs are never charged.
- 500 links = $1.50
- 2,000 links = $6.00

***

### Performance

- Uses CheerioCrawler (pure HTTP) — no headless browser overhead
- Default concurrency handled by Crawlee's built-in request queue
- External links checked with parallel HEAD requests (batches of 20)
- Typical: 200–500 links/minute depending on target site response time
- 10-second timeout per request, 1 retry on failure

***

### Limitations

- JavaScript-rendered links are not detected. This actor uses HTTP requests only (CheerioCrawler), not a headless browser. Links injected by JavaScript after page load will be missed.
- Some sites aggressively block crawlers. If you see many timeouts, try reducing `maxConcurrency` or disabling `checkExternalLinks`.
- External links are checked with HEAD requests only. Some servers respond differently to HEAD vs GET — a HEAD 404 does not always mean GET would also 404.
- Maximum 10,000 pages per run to prevent runaway costs.

***

### Related Tools by Rashadamom

- **[Domain Age Checker](https://apify.com/accurate_pouch/domain-age-checker)** — Check registration date, expiration, registrar, and age for any domain via RDAP.
- **[Tech Stack Detector](https://apify.com/accurate_pouch/tech-stack-detector)** — Detect frameworks, CMS, analytics, CDN, and 100+ technologies for any list of URLs.
- **[Google Sheets Reader & Writer](https://apify.com/accurate_pouch/google-sheets-rw)** — Read any Google Sheet to JSON or append rows. Service Account auth — no OAuth blocks.

***

### Run on Apify

[![Run on Apify](https://apify.com/static/run-on-apify.svg)](https://apify.com/accurate_pouch/broken-link-checker)

No setup needed. Click above to run in the cloud. $0.003 per operation.

# Actor input Schema

## `startUrl` (type: `string`):

The URL to start crawling from. All internal links will be followed up to maxDepth.

## `maxDepth` (type: `integer`):

Maximum link depth to follow from the start URL. 1 = only links on the start page. 3 = three levels deep.

## `maxPages` (type: `integer`):

Maximum number of pages to crawl. Prevents runaway crawls on large sites.

## `checkExternalLinks` (type: `boolean`):

Also check links pointing to other domains (uses HEAD requests — fast and lightweight).

## `respectRobotsTxt` (type: `boolean`):

Skip pages disallowed by the site's robots.txt file.

## `ignoredPatterns` (type: `array`):

URL patterns to skip. Uses glob-style matching. Example: *logout*, *admin*, \*.pdf

## `outputFormat` (type: `string`):

broken-only: only report broken/redirect/error links. all: include every link checked.

## `sitemapUrl` (type: `string`):

URL to a sitemap.xml file. All URLs from the sitemap will be crawled. If not provided, the actor auto-detects /sitemap.xml and /sitemap\_index.xml.

## `webhookUrl` (type: `string`):

POST full results (JSON) to this URL when the audit completes. Useful for alerts and integrations.

## `googleSheetsId` (type: `string`):

Export broken links to a Google Sheet. Paste the spreadsheet ID (from the URL between /d/ and /edit).

## `googleServiceAccountKey` (type: `string`):

Full JSON contents of your Google Service Account key file. Required for Sheets export. See google-sheets-rw actor README for 5-minute setup guide.

## `dryRun` (type: `boolean`):

Preview what would be crawled without any charges. Results are still returned.

## Actor input object example

```json
{
  "startUrl": "https://example.com",
  "maxDepth": 3,
  "maxPages": 500,
  "checkExternalLinks": true,
  "respectRobotsTxt": true,
  "outputFormat": "broken-only",
  "dryRun": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://example.com"
};

// Run the Actor and wait for it to finish
const run = await client.actor("accurate_pouch/broken-link-checker").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrl": "https://example.com" }

# Run the Actor and wait for it to finish
run = client.actor("accurate_pouch/broken-link-checker").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://example.com"
}' |
apify call accurate_pouch/broken-link-checker --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=accurate_pouch/broken-link-checker",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Broken Link Checker — Recursive Site Crawler",
        "description": "Recursively crawl your website and find every broken link, 404, redirect, and timeout. Checks internal and external links with configurable depth. 100 links free per run.",
        "version": "0.1",
        "x-build-id": "7SRoDFaY8adrdi2af"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/accurate_pouch~broken-link-checker/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-accurate_pouch-broken-link-checker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/accurate_pouch~broken-link-checker/runs": {
            "post": {
                "operationId": "runs-sync-accurate_pouch-broken-link-checker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/accurate_pouch~broken-link-checker/run-sync": {
            "post": {
                "operationId": "run-sync-accurate_pouch-broken-link-checker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "The URL to start crawling from. All internal links will be followed up to maxDepth."
                    },
                    "maxDepth": {
                        "title": "Max Crawl Depth",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum link depth to follow from the start URL. 1 = only links on the start page. 3 = three levels deep.",
                        "default": 3
                    },
                    "maxPages": {
                        "title": "Max Pages",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl. Prevents runaway crawls on large sites.",
                        "default": 500
                    },
                    "checkExternalLinks": {
                        "title": "Check External Links",
                        "type": "boolean",
                        "description": "Also check links pointing to other domains (uses HEAD requests — fast and lightweight).",
                        "default": true
                    },
                    "respectRobotsTxt": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "Skip pages disallowed by the site's robots.txt file.",
                        "default": true
                    },
                    "ignoredPatterns": {
                        "title": "Ignored URL Patterns",
                        "type": "array",
                        "description": "URL patterns to skip. Uses glob-style matching. Example: *logout*, *admin*, *.pdf",
                        "items": {
                            "type": "string"
                        }
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "broken-only",
                            "all"
                        ],
                        "type": "string",
                        "description": "broken-only: only report broken/redirect/error links. all: include every link checked.",
                        "default": "broken-only"
                    },
                    "sitemapUrl": {
                        "title": "Sitemap URL",
                        "type": "string",
                        "description": "URL to a sitemap.xml file. All URLs from the sitemap will be crawled. If not provided, the actor auto-detects /sitemap.xml and /sitemap_index.xml."
                    },
                    "webhookUrl": {
                        "title": "Webhook URL",
                        "type": "string",
                        "description": "POST full results (JSON) to this URL when the audit completes. Useful for alerts and integrations."
                    },
                    "googleSheetsId": {
                        "title": "Google Sheets Export — Spreadsheet ID",
                        "type": "string",
                        "description": "Export broken links to a Google Sheet. Paste the spreadsheet ID (from the URL between /d/ and /edit)."
                    },
                    "googleServiceAccountKey": {
                        "title": "Google Sheets Export — Service Account Key",
                        "type": "string",
                        "description": "Full JSON contents of your Google Service Account key file. Required for Sheets export. See google-sheets-rw actor README for 5-minute setup guide."
                    },
                    "dryRun": {
                        "title": "Dry Run",
                        "type": "boolean",
                        "description": "Preview what would be crawled without any charges. Results are still returned.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
