Broken Link Checker — Recursive Site Crawler
Pricing
Pay per usage
Broken Link Checker — Recursive Site Crawler
Recursively crawl your website and find every broken link, 404, redirect, and timeout. Checks internal and external links with configurable depth. 100 links free per run.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Manchitt Sanan
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 hours ago
Last modified
Categories
Share
Find every broken link on your website. Recursively crawl from any start URL and report all 404 errors, bad redirects, timeouts, and server errors — with the exact page and anchor text where each broken link was found.
Why this exists
Broken links hurt your SEO rankings, frustrate visitors, and make your site look unmaintained. Manually checking links on a 500-page site takes hours. This actor crawls your entire site in minutes, checks every internal and external link, and gives you a structured report.
- Recursive crawling — follows internal links up to a configurable depth, not just one page
- External link checking — lightweight HEAD requests to verify links to other domains
- Status categorization — every link classified as broken (404/410/5xx), redirect (301/302), timeout, or OK
- Severity levels — critical (404, 5xx), warning (redirects, timeouts), info (working links)
- Context — shows which page the broken link was found on and what the anchor text says
- 100 links — try it on your site with zero risk
Quick start
{"startUrl": "https://example.com","maxDepth": 3,"maxPages": 500,"checkExternalLinks": true}
Hit Start and get a full report in minutes.
Feature comparison
| Feature | HTTP Status Checker | parseforge | This actor |
|---|---|---|---|
| Single URL check | Yes | Yes | Yes |
| Recursive site crawl | No | Yes | Yes |
| External link checking | No | Yes | Yes |
| Status categorization | No | Basic | 404/301/302/500/timeout |
| Severity classification | No | No | critical / warning / info |
| Anchor text context | No | No | Yes |
| Source page tracking | No | Yes | Yes |
| Configurable depth | No | Yes | Yes |
| Configurable max pages | No | Yes | Yes |
| Respect robots.txt | No | No | Yes (configurable) |
| URL pattern exclusion | No | No | Yes (glob patterns) |
| Dry run mode | No | No | Yes |
| Free tier | No | No | 100 links free |
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string | (required) | URL to start crawling from |
maxDepth | integer | 3 | Maximum link depth to follow (1–10) |
maxPages | integer | 500 | Maximum pages to crawl (1–10,000) |
checkExternalLinks | boolean | true | Check links pointing to other domains |
respectRobotsTxt | boolean | true | Skip pages disallowed by robots.txt |
ignoredPatterns | array | [] | URL patterns to skip (glob-style: *logout*, *admin*) |
outputFormat | enum | broken-only | broken-only or all |
sitemapUrl | string | (auto-detect) | URL to sitemap.xml. If not set, auto-checks /sitemap.xml and /sitemap_index.xml |
webhookUrl | string | (optional) | POST full JSON results to this URL when audit completes |
googleSheetsId | string | (optional) | Export broken links to this Google Sheet (spreadsheet ID) |
googleServiceAccountKey | string | (optional) | Google Service Account JSON key for Sheets export |
dryRun | boolean | false | Preview what would be crawled — no charges |
Output
{"status": "success","startUrl": "https://example.com","summary": {"pagesChecked": 142,"linksChecked": 1847,"brokenLinks": 12,"redirects": 34,"errors": 3},"brokenLinks": [{"url": "https://example.com/old-page","statusCode": 404,"statusCategory": "broken","severity": "critical","foundOn": "https://example.com/blog/post-1","anchorText": "Learn more","lastChecked": "2026-04-13T10:30:00Z","error": null}]}
Status categories
| Category | HTTP codes | Severity | Meaning |
|---|---|---|---|
broken | 404, 410, 5xx | critical | Link target is dead or server is failing |
redirect | 301, 302, 303, 307, 308 | warning | Link works but goes through a redirect — consider updating |
timeout | — | warning | Server did not respond within 10 seconds |
error | — | critical | Network error, DNS failure, or connection refused |
ok | 2xx | info | Link is working (only shown in all output mode) |
Pricing
$0.003 per link checked (pay-per-event pricing).
- Only charged on successful runs — errors and dry runs are never charged.
- 500 links = $1.50
- 2,000 links = $6.00
Performance
- Uses CheerioCrawler (pure HTTP) — no headless browser overhead
- Default concurrency handled by Crawlee's built-in request queue
- External links checked with parallel HEAD requests (batches of 20)
- Typical: 200–500 links/minute depending on target site response time
- 10-second timeout per request, 1 retry on failure
Limitations
- JavaScript-rendered links are not detected. This actor uses HTTP requests only (CheerioCrawler), not a headless browser. Links injected by JavaScript after page load will be missed.
- Some sites aggressively block crawlers. If you see many timeouts, try reducing
maxConcurrencyor disablingcheckExternalLinks. - External links are checked with HEAD requests only. Some servers respond differently to HEAD vs GET — a HEAD 404 does not always mean GET would also 404.
- Maximum 10,000 pages per run to prevent runaway costs.
Related Tools by Rashadamom
- Domain Age Checker — Check registration date, expiration, registrar, and age for any domain via RDAP.
- Tech Stack Detector — Detect frameworks, CMS, analytics, CDN, and 100+ technologies for any list of URLs.
- Google Sheets Reader & Writer — Read any Google Sheet to JSON or append rows. Service Account auth — no OAuth blocks.
Run on Apify
No setup needed. Click above to run in the cloud. $0.003 per operation.
