Find Broken Links avatar

Find Broken Links

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Find Broken Links

Find Broken Links

Crawl a website (start URL + same-host pages up to a configurable depth) and report every link that returns a 4xx / 5xx status, times out, or has a DNS error. HTTP-only — no proxy or browser needed.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(21)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

21

Bookmarked

2

Total users

1

Monthly active users

6 days ago

Last modified

Share

Crawl a website and report every link that returns a 4xx / 5xx status, times out, or fails DNS. Bounded by maxCrawlDepth and maxPages so it stays predictable on large sites. HTTP-only — no proxy, no browser.

What it does

You give it a start URL; the actor crawls the start page (and optionally same-host internal links up to a depth N), gathers every <a href>, and probes each one with HEAD (falling back to GET when servers reject HEAD). Records are emitted only for links that fail.

The dataset is never empty — even a perfectly-clean site gets a final summary record with run statistics.

Input

FieldTypeDefaultDescription
startUrlstring (required)https://apify.comPage to start crawling from. Must be http:// or https://.
maxCrawlDepthinteger1 (0–5)0 = check links on start URL only; 1+ = follow internal links one level and check theirs too.
maxPagesinteger50 (1–5000)Hard cap on pages crawled.
checkExternalLinksbooleantrueAlso probe links that leave the start URL's host.
verifyWithProxybooleantrueWhen a link returns 401 / 403 / 405 / 429 / 451 (typical anti-bot signals), retry once via Apify residential proxy. If the proxy retry succeeds the link is treated as OK — eliminates false positives from sites that block datacenter IPs (G2, Capterra, etc.). Turn off to skip the retry pass.
maxConcurrencyinteger10 (1–50)Concurrent HEAD/GET requests during the check phase.
userAgentstring (optional)(Chrome 131)Override only if a target server filters by UA.

Example input

{
"startUrl": "https://apify.com",
"maxCrawlDepth": 1,
"maxPages": 50,
"checkExternalLinks": true,
"maxConcurrency": 10
}

Output

{
"url": "https://example.com/old-blog-post",
"sourcePage": "https://apify.com/blog/index",
"anchorText": "Read more",
"linkType": "external",
"linkDomain": "example.com",
"isExternalLink": true,
"httpStatus": 404,
"errorReason": "not_found",
"proxyRecheckStatus": 404,
"scrapedAt": "2024-12-16T14:23:11+00:00"
}

Summary record (always emitted last)

{
"_recordType": "summary",
"startUrl": "https://apify.com",
"pagesCrawled": 12,
"linksDiscovered": 480,
"linksChecked": 480,
"brokenCount": 3,
"okCount": 477,
"breakdown": {"not_found": 2, "server_error": 1},
"maxCrawlDepth": 1,
"checkExternalLinks": true,
"scrapedAt": "2024-12-16T14:23:18+00:00"
}

Output fields

  • url — the broken link's absolute URL.
  • sourcePage — page where the link was first discovered.
  • anchorText — visible text of the <a> element (when present).
  • linkType"internal" (same host as start URL) or "external".
  • linkDomain — derived hostname of the broken url (lowercase, includes any port).
  • isExternalLink — derived boolean: true when the broken link's host differs from sourcePage's host.
  • httpStatus — HTTP status code (omitted for network errors / timeouts).
  • errorReason — one of:
    • not_found (404), gone (410), forbidden (403), unauthorized (401), server_error (5xx), client_error_<NNN> (other 4xx)
    • timeout, dns_error, connection_refused, tls_error, redirect_loop, network_error
  • proxyRecheckStatus — only present when verifyWithProxy: true triggered a retry. Shows the status returned via residential proxy (use this to distinguish real broken links from anti-bot blocks).
  • scrapedAt — ISO-8601 timestamp.

Use cases

  • SEO audits — every broken link costs link equity and damages user trust.
  • Site migration validation — after a CMS move, find the URLs that didn't get redirected.
  • Editorial QA — catch dead links in blog content, reference pages, footer navigation.
  • Internal-tools health — spot broken links to deprecated wikis, retired tools, expired SSO redirects.

FAQ

Does it need a proxy? For the bulk crawl, no — the actor uses curl_cffi with a Chrome User-Agent from a datacenter IP. Optionally, when verifyWithProxy: true (default), any link that returns 401 / 403 / 405 / 429 / 451 is retried once via Apify residential proxy. If that retry succeeds, the link is treated as OK — this eliminates the false positives that used to surface from sites like G2, Capterra, or rate-limited APIs. The retried status is surfaced as proxyRecheckStatus so you can see both checks.

HEAD vs GET — which is used? HEAD first (saves bandwidth). If a server returns 405 or 501, the actor falls back to GET and uses that status instead.

Will it follow redirects? Yes — allow_redirects=True for both HEAD and GET. The final status is what gets recorded.

Can I limit it to internal links only? Set checkExternalLinks: false. The actor still walks the same-host graph for discovery but only probes internal links.

Why is the dataset never empty? Even when no broken links are found, a _recordType: "summary" record is emitted with run stats. This keeps Apify's daily-test happy and gives you a quick health pulse for the site.

My start URL has thousands of pages — will this finish in time? Use maxPages and maxCrawlDepth to keep runs bounded. For large sites, consider running with maxCrawlDepth: 0 first to audit the start page's links, then expand outward.

The summary says brokenCount: 0 but I know some links are dead.

  • The link may use a non-HTTP scheme (mailto, javascript:, data:) — those aren't checkable.
  • The link may be JS-rendered (this scraper sees only server-rendered HTML).
  • The target may serve different content / status to its own site than to a generic crawler — try with the site's own User-Agent via userAgent.