Find Broken Links
Pricing
from $1.00 / 1,000 results
Find Broken Links
Crawl a website (start URL + same-host pages up to a configurable depth) and report every link that returns a 4xx / 5xx status, times out, or has a DNS error. HTTP-only — no proxy or browser needed.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(21)
Developer
Crawler Bros
Maintained by CommunityActor stats
21
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
Crawl a website and report every link that returns a 4xx / 5xx status, times out, or fails DNS. Bounded by maxCrawlDepth and maxPages so it stays predictable on large sites. HTTP-only — no proxy, no browser.
What it does
You give it a start URL; the actor crawls the start page (and optionally same-host internal links up to a depth N), gathers every <a href>, and probes each one with HEAD (falling back to GET when servers reject HEAD). Records are emitted only for links that fail.
The dataset is never empty — even a perfectly-clean site gets a final summary record with run statistics.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string (required) | https://apify.com | Page to start crawling from. Must be http:// or https://. |
maxCrawlDepth | integer | 1 (0–5) | 0 = check links on start URL only; 1+ = follow internal links one level and check theirs too. |
maxPages | integer | 50 (1–5000) | Hard cap on pages crawled. |
checkExternalLinks | boolean | true | Also probe links that leave the start URL's host. |
verifyWithProxy | boolean | true | When a link returns 401 / 403 / 405 / 429 / 451 (typical anti-bot signals), retry once via Apify residential proxy. If the proxy retry succeeds the link is treated as OK — eliminates false positives from sites that block datacenter IPs (G2, Capterra, etc.). Turn off to skip the retry pass. |
maxConcurrency | integer | 10 (1–50) | Concurrent HEAD/GET requests during the check phase. |
userAgent | string (optional) | (Chrome 131) | Override only if a target server filters by UA. |
Example input
{"startUrl": "https://apify.com","maxCrawlDepth": 1,"maxPages": 50,"checkExternalLinks": true,"maxConcurrency": 10}
Output
Broken-link record (one per failure)
{"url": "https://example.com/old-blog-post","sourcePage": "https://apify.com/blog/index","anchorText": "Read more","linkType": "external","linkDomain": "example.com","isExternalLink": true,"httpStatus": 404,"errorReason": "not_found","proxyRecheckStatus": 404,"scrapedAt": "2024-12-16T14:23:11+00:00"}
Summary record (always emitted last)
{"_recordType": "summary","startUrl": "https://apify.com","pagesCrawled": 12,"linksDiscovered": 480,"linksChecked": 480,"brokenCount": 3,"okCount": 477,"breakdown": {"not_found": 2, "server_error": 1},"maxCrawlDepth": 1,"checkExternalLinks": true,"scrapedAt": "2024-12-16T14:23:18+00:00"}
Output fields
url— the broken link's absolute URL.sourcePage— page where the link was first discovered.anchorText— visible text of the<a>element (when present).linkType—"internal"(same host as start URL) or"external".linkDomain— derived hostname of the brokenurl(lowercase, includes any port).isExternalLink— derived boolean:truewhen the broken link's host differs fromsourcePage's host.httpStatus— HTTP status code (omitted for network errors / timeouts).errorReason— one of:not_found(404),gone(410),forbidden(403),unauthorized(401),server_error(5xx),client_error_<NNN>(other 4xx)timeout,dns_error,connection_refused,tls_error,redirect_loop,network_error
proxyRecheckStatus— only present whenverifyWithProxy: truetriggered a retry. Shows the status returned via residential proxy (use this to distinguish real broken links from anti-bot blocks).scrapedAt— ISO-8601 timestamp.
Use cases
- SEO audits — every broken link costs link equity and damages user trust.
- Site migration validation — after a CMS move, find the URLs that didn't get redirected.
- Editorial QA — catch dead links in blog content, reference pages, footer navigation.
- Internal-tools health — spot broken links to deprecated wikis, retired tools, expired SSO redirects.
FAQ
Does it need a proxy?
For the bulk crawl, no — the actor uses curl_cffi with a Chrome User-Agent from a datacenter IP. Optionally, when verifyWithProxy: true (default), any link that returns 401 / 403 / 405 / 429 / 451 is retried once via Apify residential proxy. If that retry succeeds, the link is treated as OK — this eliminates the false positives that used to surface from sites like G2, Capterra, or rate-limited APIs. The retried status is surfaced as proxyRecheckStatus so you can see both checks.
HEAD vs GET — which is used? HEAD first (saves bandwidth). If a server returns 405 or 501, the actor falls back to GET and uses that status instead.
Will it follow redirects?
Yes — allow_redirects=True for both HEAD and GET. The final status is what gets recorded.
Can I limit it to internal links only?
Set checkExternalLinks: false. The actor still walks the same-host graph for discovery but only probes internal links.
Why is the dataset never empty?
Even when no broken links are found, a _recordType: "summary" record is emitted with run stats. This keeps Apify's daily-test happy and gives you a quick health pulse for the site.
My start URL has thousands of pages — will this finish in time?
Use maxPages and maxCrawlDepth to keep runs bounded. For large sites, consider running with maxCrawlDepth: 0 first to audit the start page's links, then expand outward.
The summary says brokenCount: 0 but I know some links are dead.
- The link may use a non-HTTP scheme (mailto, javascript:, data:) — those aren't checkable.
- The link may be JS-rendered (this scraper sees only server-rendered HTML).
- The target may serve different content / status to its own site than to a generic crawler — try with the site's own User-Agent via
userAgent.