Broken Link Checker — Recursive Site Crawler avatar

Broken Link Checker — Recursive Site Crawler

Pricing

Pay per usage

Go to Apify Store
Broken Link Checker — Recursive Site Crawler

Broken Link Checker — Recursive Site Crawler

Recursively crawl your website and find every broken link, 404, redirect, and timeout. Checks internal and external links with configurable depth. 100 links free per run.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Manchitt Sanan

Manchitt Sanan

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 hours ago

Last modified

Share

Find every broken link on your website. Recursively crawl from any start URL and report all 404 errors, bad redirects, timeouts, and server errors — with the exact page and anchor text where each broken link was found.


Why this exists

Broken links hurt your SEO rankings, frustrate visitors, and make your site look unmaintained. Manually checking links on a 500-page site takes hours. This actor crawls your entire site in minutes, checks every internal and external link, and gives you a structured report.

  • Recursive crawling — follows internal links up to a configurable depth, not just one page
  • External link checking — lightweight HEAD requests to verify links to other domains
  • Status categorization — every link classified as broken (404/410/5xx), redirect (301/302), timeout, or OK
  • Severity levels — critical (404, 5xx), warning (redirects, timeouts), info (working links)
  • Context — shows which page the broken link was found on and what the anchor text says
  • 100 links — try it on your site with zero risk

Quick start

{
"startUrl": "https://example.com",
"maxDepth": 3,
"maxPages": 500,
"checkExternalLinks": true
}

Hit Start and get a full report in minutes.


Feature comparison

FeatureHTTP Status CheckerparseforgeThis actor
Single URL checkYesYesYes
Recursive site crawlNoYesYes
External link checkingNoYesYes
Status categorizationNoBasic404/301/302/500/timeout
Severity classificationNoNocritical / warning / info
Anchor text contextNoNoYes
Source page trackingNoYesYes
Configurable depthNoYesYes
Configurable max pagesNoYesYes
Respect robots.txtNoNoYes (configurable)
URL pattern exclusionNoNoYes (glob patterns)
Dry run modeNoNoYes
Free tierNoNo100 links free

Input

FieldTypeDefaultDescription
startUrlstring(required)URL to start crawling from
maxDepthinteger3Maximum link depth to follow (1–10)
maxPagesinteger500Maximum pages to crawl (1–10,000)
checkExternalLinksbooleantrueCheck links pointing to other domains
respectRobotsTxtbooleantrueSkip pages disallowed by robots.txt
ignoredPatternsarray[]URL patterns to skip (glob-style: *logout*, *admin*)
outputFormatenumbroken-onlybroken-only or all
sitemapUrlstring(auto-detect)URL to sitemap.xml. If not set, auto-checks /sitemap.xml and /sitemap_index.xml
webhookUrlstring(optional)POST full JSON results to this URL when audit completes
googleSheetsIdstring(optional)Export broken links to this Google Sheet (spreadsheet ID)
googleServiceAccountKeystring(optional)Google Service Account JSON key for Sheets export
dryRunbooleanfalsePreview what would be crawled — no charges

Output

{
"status": "success",
"startUrl": "https://example.com",
"summary": {
"pagesChecked": 142,
"linksChecked": 1847,
"brokenLinks": 12,
"redirects": 34,
"errors": 3
},
"brokenLinks": [
{
"url": "https://example.com/old-page",
"statusCode": 404,
"statusCategory": "broken",
"severity": "critical",
"foundOn": "https://example.com/blog/post-1",
"anchorText": "Learn more",
"lastChecked": "2026-04-13T10:30:00Z",
"error": null
}
]
}

Status categories

CategoryHTTP codesSeverityMeaning
broken404, 410, 5xxcriticalLink target is dead or server is failing
redirect301, 302, 303, 307, 308warningLink works but goes through a redirect — consider updating
timeoutwarningServer did not respond within 10 seconds
errorcriticalNetwork error, DNS failure, or connection refused
ok2xxinfoLink is working (only shown in all output mode)

Pricing

$0.003 per link checked (pay-per-event pricing).

  • Only charged on successful runs — errors and dry runs are never charged.
  • 500 links = $1.50
  • 2,000 links = $6.00

Performance

  • Uses CheerioCrawler (pure HTTP) — no headless browser overhead
  • Default concurrency handled by Crawlee's built-in request queue
  • External links checked with parallel HEAD requests (batches of 20)
  • Typical: 200–500 links/minute depending on target site response time
  • 10-second timeout per request, 1 retry on failure

Limitations

  • JavaScript-rendered links are not detected. This actor uses HTTP requests only (CheerioCrawler), not a headless browser. Links injected by JavaScript after page load will be missed.
  • Some sites aggressively block crawlers. If you see many timeouts, try reducing maxConcurrency or disabling checkExternalLinks.
  • External links are checked with HEAD requests only. Some servers respond differently to HEAD vs GET — a HEAD 404 does not always mean GET would also 404.
  • Maximum 10,000 pages per run to prevent runaway costs.

  • Domain Age Checker — Check registration date, expiration, registrar, and age for any domain via RDAP.
  • Tech Stack Detector — Detect frameworks, CMS, analytics, CDN, and 100+ technologies for any list of URLs.
  • Google Sheets Reader & Writer — Read any Google Sheet to JSON or append rows. Service Account auth — no OAuth blocks.

Run on Apify

Run on Apify

No setup needed. Click above to run in the cloud. $0.003 per operation.