# Find Broken Links (`crawlerbros/find-broken-links`) Actor

Crawl a website (start URL + same-host pages up to a configurable depth) and report every link that returns a 4xx / 5xx status, times out, or has a DNS error. HTTP-only — no proxy or browser needed.

- **URL**: https://apify.com/crawlerbros/find-broken-links.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** SEO tools, Developer tools, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 21 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Find Broken Links

Crawl a website and report every link that returns a 4xx / 5xx status, times out, or fails DNS. Bounded by `maxCrawlDepth` and `maxPages` so it stays predictable on large sites. HTTP-only — no proxy, no browser.

### What it does

You give it a start URL; the actor crawls the start page (and optionally same-host internal links up to a depth N), gathers every `<a href>`, and probes each one with HEAD (falling back to GET when servers reject HEAD). Records are emitted only for links that fail.

The dataset is never empty — even a perfectly-clean site gets a final summary record with run statistics.

### Input

| Field | Type | Default | Description |
|---|---|---|---|
| `startUrl` | string (required) | `https://apify.com` | Page to start crawling from. Must be `http://` or `https://`. |
| `maxCrawlDepth` | integer | `1` (0–5) | 0 = check links on start URL only; 1+ = follow internal links one level and check theirs too. |
| `maxPages` | integer | `50` (1–5000) | Hard cap on pages crawled. |
| `checkExternalLinks` | boolean | `true` | Also probe links that leave the start URL's host. |
| `verifyWithProxy` | boolean | `true` | When a link returns `401 / 403 / 405 / 429 / 451` (typical anti-bot signals), retry once via Apify residential proxy. If the proxy retry succeeds the link is treated as OK — eliminates false positives from sites that block datacenter IPs (G2, Capterra, etc.). Turn off to skip the retry pass. |
| `maxConcurrency` | integer | `10` (1–50) | Concurrent HEAD/GET requests during the check phase. |
| `userAgent` | string (optional) | (Chrome 131) | Override only if a target server filters by UA. |

#### Example input

```json
{
  "startUrl": "https://apify.com",
  "maxCrawlDepth": 1,
  "maxPages": 50,
  "checkExternalLinks": true,
  "maxConcurrency": 10
}
````

### Output

#### Broken-link record (one per failure)

```json
{
  "url": "https://example.com/old-blog-post",
  "sourcePage": "https://apify.com/blog/index",
  "anchorText": "Read more",
  "linkType": "external",
  "linkDomain": "example.com",
  "isExternalLink": true,
  "httpStatus": 404,
  "errorReason": "not_found",
  "proxyRecheckStatus": 404,
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}
```

#### Summary record (always emitted last)

```json
{
  "_recordType": "summary",
  "startUrl": "https://apify.com",
  "pagesCrawled": 12,
  "linksDiscovered": 480,
  "linksChecked": 480,
  "brokenCount": 3,
  "okCount": 477,
  "breakdown": {"not_found": 2, "server_error": 1},
  "maxCrawlDepth": 1,
  "checkExternalLinks": true,
  "scrapedAt": "2024-12-16T14:23:18+00:00"
}
```

#### Output fields

- **`url`** — the broken link's absolute URL.
- **`sourcePage`** — page where the link was first discovered.
- **`anchorText`** — visible text of the `<a>` element (when present).
- **`linkType`** — `"internal"` (same host as start URL) or `"external"`.
- **`linkDomain`** — derived hostname of the broken `url` (lowercase, includes any port).
- **`isExternalLink`** — derived boolean: `true` when the broken link's host differs from `sourcePage`'s host.
- **`httpStatus`** — HTTP status code (omitted for network errors / timeouts).
- **`errorReason`** — one of:
  - `not_found` (404), `gone` (410), `forbidden` (403), `unauthorized` (401), `server_error` (5xx), `client_error_<NNN>` (other 4xx)
  - `timeout`, `dns_error`, `connection_refused`, `tls_error`, `redirect_loop`, `network_error`
- **`proxyRecheckStatus`** — only present when `verifyWithProxy: true` triggered a retry. Shows the status returned via residential proxy (use this to distinguish real broken links from anti-bot blocks).
- **`scrapedAt`** — ISO-8601 timestamp.

### Use cases

- **SEO audits** — every broken link costs link equity and damages user trust.
- **Site migration validation** — after a CMS move, find the URLs that didn't get redirected.
- **Editorial QA** — catch dead links in blog content, reference pages, footer navigation.
- **Internal-tools health** — spot broken links to deprecated wikis, retired tools, expired SSO redirects.

### FAQ

**Does it need a proxy?**
For the bulk crawl, no — the actor uses `curl_cffi` with a Chrome User-Agent from a datacenter IP. **Optionally**, when `verifyWithProxy: true` (default), any link that returns `401 / 403 / 405 / 429 / 451` is retried once via Apify residential proxy. If that retry succeeds, the link is treated as OK — this eliminates the false positives that used to surface from sites like G2, Capterra, or rate-limited APIs. The retried status is surfaced as `proxyRecheckStatus` so you can see both checks.

**HEAD vs GET — which is used?**
HEAD first (saves bandwidth). If a server returns 405 or 501, the actor falls back to GET and uses that status instead.

**Will it follow redirects?**
Yes — `allow_redirects=True` for both HEAD and GET. The final status is what gets recorded.

**Can I limit it to internal links only?**
Set `checkExternalLinks: false`. The actor still walks the same-host graph for discovery but only probes internal links.

**Why is the dataset never empty?**
Even when no broken links are found, a `_recordType: "summary"` record is emitted with run stats. This keeps Apify's daily-test happy and gives you a quick health pulse for the site.

**My start URL has thousands of pages — will this finish in time?**
Use `maxPages` and `maxCrawlDepth` to keep runs bounded. For large sites, consider running with `maxCrawlDepth: 0` first to audit the start page's links, then expand outward.

**The summary says `brokenCount: 0` but I know some links are dead.**

- The link may use a non-HTTP scheme (mailto, javascript:, data:) — those aren't checkable.
- The link may be JS-rendered (this scraper sees only server-rendered HTML).
- The target may serve different content / status to its own site than to a generic crawler — try with the site's own User-Agent via `userAgent`.

# Actor input Schema

## `startUrl` (type: `string`):

Page to start crawling from. Internal links discovered on this page (and on any further pages crawled to depth N) get checked.

## `maxCrawlDepth` (type: `integer`):

0 = only check links on the start URL. 1 = also follow internal links from the start page and check links on those. Higher values widen the crawl.

## `maxPages` (type: `integer`):

Hard cap on how many pages the crawler fetches. Helps keep runs bounded for large sites.

## `checkExternalLinks` (type: `boolean`):

Also probe links that point to other domains. Disable to only check links within the start URL's host (faster, less noisy).

## `verifyWithProxy` (type: `boolean`):

When a link returns 401/403/405/429/451 — often anti-bot blocks rather than truly broken — retry once via Apify residential proxy. If the proxy retry returns 2xx, the link is treated as OK (avoids false positives). Adds slight cost; turn off to skip retries.

## `maxConcurrency` (type: `integer`):

How many HEAD/GET requests to run in parallel during link-check phase.

## `userAgent` (type: `string`):

Override the default Chrome User-Agent. Most sites accept the default; only set this if a target server filters by UA.

## Actor input object example

```json
{
  "startUrl": "https://apify.com",
  "maxCrawlDepth": 1,
  "maxPages": 50,
  "checkExternalLinks": true,
  "verifyWithProxy": true,
  "maxConcurrency": 10
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://apify.com",
    "maxCrawlDepth": 1,
    "maxPages": 50
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/find-broken-links").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrl": "https://apify.com",
    "maxCrawlDepth": 1,
    "maxPages": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/find-broken-links").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://apify.com",
  "maxCrawlDepth": 1,
  "maxPages": 50
}' |
apify call crawlerbros/find-broken-links --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/find-broken-links",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Find Broken Links",
        "description": "Crawl a website (start URL + same-host pages up to a configurable depth) and report every link that returns a 4xx / 5xx status, times out, or has a DNS error. HTTP-only — no proxy or browser needed.",
        "version": "0.1",
        "x-build-id": "aZl2HVqxXEWn8nyXY"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~find-broken-links/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-find-broken-links",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~find-broken-links/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-find-broken-links",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~find-broken-links/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-find-broken-links",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "Page to start crawling from. Internal links discovered on this page (and on any further pages crawled to depth N) get checked."
                    },
                    "maxCrawlDepth": {
                        "title": "Maximum crawl depth",
                        "minimum": 0,
                        "maximum": 5,
                        "type": "integer",
                        "description": "0 = only check links on the start URL. 1 = also follow internal links from the start page and check links on those. Higher values widen the crawl.",
                        "default": 1
                    },
                    "maxPages": {
                        "title": "Maximum pages to crawl",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Hard cap on how many pages the crawler fetches. Helps keep runs bounded for large sites.",
                        "default": 50
                    },
                    "checkExternalLinks": {
                        "title": "Check external links",
                        "type": "boolean",
                        "description": "Also probe links that point to other domains. Disable to only check links within the start URL's host (faster, less noisy).",
                        "default": true
                    },
                    "verifyWithProxy": {
                        "title": "Verify suspicious 4xx via residential proxy",
                        "type": "boolean",
                        "description": "When a link returns 401/403/405/429/451 — often anti-bot blocks rather than truly broken — retry once via Apify residential proxy. If the proxy retry returns 2xx, the link is treated as OK (avoids false positives). Adds slight cost; turn off to skip retries.",
                        "default": true
                    },
                    "maxConcurrency": {
                        "title": "Maximum concurrent link checks",
                        "minimum": 1,
                        "maximum": 50,
                        "type": "integer",
                        "description": "How many HEAD/GET requests to run in parallel during link-check phase.",
                        "default": 10
                    },
                    "userAgent": {
                        "title": "Custom User-Agent (optional)",
                        "type": "string",
                        "description": "Override the default Chrome User-Agent. Most sites accept the default; only set this if a target server filters by UA."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
