Pricing

from $0.15 / 1,000 discovered links

Website URL Crawler & Link Extractor

Crawl JavaScript-rendered websites and export a URL link map. Get source pages, depth, anchor text, link type, HTTP metadata, and crawl status.

Pricing

from $0.15 / 1,000 discovered links

Rating

0.0

(0)

Developer

Maxime Dupré

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

🔗 Website URL crawler for rendered links and sitemaps

Website URL Crawler crawls public websites and extracts URLs from both rendered pages and sitemaps. Add one or more website URLs or domains, and the Actor returns a clean URL inventory with depth, parent URL, anchor text, link type, HTTP status, and sitemap metadata when the site provides it.

Use this website URL crawler for SEO audits, website migrations, QA checks, internal linking reviews, broken-link workflows, and RAG source inventories. It is useful when a plain sitemap URL extractor is not enough, because the same run can also find links from JavaScript-rendered navigation.

For a quick first run, keep the prefilled IANA reserved domains page, Crawlee, and Apify Docs. IANA is a small rendered-link crawl. Crawlee and Apify Docs have public sitemaps, so you can see sitemap-backed URL rows too.

🧭 What this Actor does

Crawls one or more public website URLs or bare domains.
Opens pages in a browser and extracts rendered anchor links.
Discovers sitemap URLs from robots.txt and common sitemap paths.
Parses sitemap URL sets, sitemap indexes, text sitemaps, and gzipped sitemap files.
Merges rendered and sitemap evidence for the same normalized URL.
Keeps hierarchy facts such as parent URL, depth, and anchor text when rendered navigation finds the link.
Adds sitemap facts such as sitemap URL, lastmod, priority, and changefreq when available.
Filters the final URL inventory by keywords that appear in the URL.
Exports rows from Apify as JSON, CSV, Excel, XML, RSS, HTML, or through the Apify API.

The Actor is made for URL discovery and link-map exports. It does not scrape full page content, submit forms, log in, click through menus, check search-index status, or mirror website files into storage.

📦 Data you get

Every saved item is one accepted website URL found from rendered navigation, sitemap discovery, or both.

startUrl - normalized submitted website or domain that produced the row.
url - normalized URL found by the crawl.
discoverySources - rendered, sitemap, or both.
parentUrl - rendered page where the URL was found, when available.
depth - rendered crawl depth from the start URL, when available.
anchorText - visible link text from rendered navigation, when available.
relationship - whether the URL is internal or external for the submitted website.
linkType - page, document, media, or other.
crawlStatus - whether the URL was loaded as a page or only discovered.
httpStatusCode, finalUrl, and contentType - page response facts for loaded pages when HTTP checks are enabled.
sitemapUrl, lastmod, priority, and changefreq - sitemap source and metadata when the sitemap provides them.

🚀 How to run it

Add one or more public websites, domains, or page URLs.
Leave URL keywords empty for a full URL inventory, or add words such as blog, docs, or pricing to keep only matching URLs.
Set Max URL rows to control the total output size and cost.
Choose how many rendered pages to open per website.
Pick the crawl depth and maximum rendered links per page.
Choose whether to stay on the same host, stay on the same domain, or include external URLs as discovered rows.
Run the Actor and open the dataset.

Domains such as example.com are accepted and normalized to HTTPS. Full URLs such as https://example.com/docs are also accepted.

🧾 Input options

Website URLs is the only required input. Add the websites or pages you want to map.

URL keywords keeps URLs that contain all listed words or path parts. The filter checks URL text, not page meaning.

Max URL rows limits accepted output rows across the whole run. Use 0 only when you want every URL found within the other crawl limits.

Max pages per website controls how many rendered pages are opened for each submitted website.

Max crawl depth controls how many link levels the Actor follows from each start URL. Use 0 when you only want links from the submitted page plus sitemap-discovered URLs.

Max links per page limits how many rendered links are saved and considered from each loaded page.

Crawl scope controls which internal links can be followed. External links can be saved as discovered rows, but they are not crawled further.

Asset links controls whether the dataset includes only page URLs, page plus document URLs, or all links including media assets.

Ignored extensions skips common file extensions unless Asset links is set to include all links.

Check HTTP status adds status code, final URL, and content type for loaded pages.

🧪 Output example

{
	"startUrl": "https://crawlee.dev/",
	"url": "https://crawlee.dev/blog",
	"discoverySources": ["sitemap", "rendered"],
	"parentUrl": "https://crawlee.dev/js",
	"depth": 2,
	"anchorText": "Blog",
	"relationship": "internal",
	"linkType": "page",
	"crawlStatus": "discovered",
	"httpStatusCode": null,
	"finalUrl": null,
	"contentType": null,
	"sitemapUrl": "https://crawlee.dev/sitemap.xml",
	"lastmod": null,
	"priority": 0.5,
	"changefreq": "weekly"
}

💳 Pricing

This Actor uses pay-per-event pricing. You are charged for each accepted website URL saved to the dataset. The pricing event is called Discovered link.

Use a small Max URL rows value for your first run. Once the output looks right, increase Max URL rows, Max pages per website, and Max crawl depth for broader website inventories.

⚠️ Limits and caveats

Website URL Crawler uses a browser for rendered link discovery, so it favors coverage over the lowest possible runtime cost. Large sites can publish thousands of sitemap URLs and many rendered links; set limits before running broad crawls.

Sitemap metadata is source-backed only. If a sitemap omits lastmod, priority, or changefreq, those fields stay empty.

HTTP status, final URL, and content type are available for loaded pages. URLs that are only discovered from a sitemap or an unvisited link may not have those page response fields.

The Actor reads public website URLs. It does not use source credentials, user cookies, private APIs, browser extensions, or page content enrichment.

❓ FAQ

🌐 Does this crawl JavaScript-rendered websites?

Yes. Rendered pages are opened in a browser, and links are extracted after the page loads.

🗺️ Does it parse sitemaps?

Yes. The Actor checks robots.txt, common sitemap paths, sitemap indexes, text sitemaps, and gzipped sitemap files. It saves sitemap metadata when the source sitemap provides it.

🔎 Can I filter URLs by keyword?

Yes. Add URL keywords such as blog, docs, or product. A URL must contain every listed keyword to be saved.

🌍 Will it crawl external websites too?

No. External URLs can be saved as discovered rows when your settings allow them, but the crawler only follows internal page URLs within the selected scope.

📄 Can I crawl only one page?

Yes. Set Max crawl depth to 0 and keep Max pages per website low when you only want the submitted page's rendered links plus sitemap-discovered URLs.

🧯 Is this a broken link checker?

It can support broken-link workflows by exporting URL rows and HTTP metadata for loaded pages, but the main output is a website URL inventory and link map.

📝 Changelog

1.0: Added sitemap discovery, sitemap metadata, URL keyword filtering, total row limits, and the new merged URL inventory output.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

🔗 Other actors

Sitemap Sniffer ↗ - Find public sitemap files and optional sitemap URL inventory rows.
Website Emails Scraper ↗ - Find public email addresses on websites you already plan to crawl.
Business Address Scraper ↗ - Extract physical business addresses from public company websites.
Font Detector ↗ - Audit fonts, font files, and typography metadata from public pages.
SEMrush Free Website Stats Scraper ↗ - Export public SEMrush website stats for domains and URLs.

Made with ❤️ by Maxime Dupré

Website URL Crawler & Link Extractor

maged120/get-urls-pro

Crawl any website and extract all URLs with full hierarchy — depth, parent URL, and anchor text. Supports static and JavaScript-rendered sites. Configurable depth and domain filtering.

Maged

Website Link Extractor — List All URLs from Any Page

maged120/get-urls

Extract all links from any web page. Returns every URL found with anchor text and link type — useful for quick link audits, competitor research, or sitemap building.

Maged

Broken Link Crawler

pattonholdings/broken-link-crawler

Crawl a site, find every broken link, return one row per broken link with full referrer trail. Fetch-only (no headless browser) for speed and predictable cost. Configurable depth + external link inclusion.

Coleton Patton

Broken Link Checker & Scraper - 404 Audit API

pink_comic/broken-link-checker

Scan pages for broken links, dead URLs, 404s, redirects, timeouts, and resource errors. Bulk link checker/scraper for SEO audits, content QA, site migrations, and link-rot monitoring. Returns source URL, link URL, anchor text, status code, broken flag, and error details.

Ava Torres

Broken Link Checker

taroyamada/broken-link-checker

Crawl supplied websites to find dead internal and outbound links with status codes, anchor context, redirect hints, and source pages.

naoki anzai

Video Download Link Crawler

rodrigo91/video-download-link-crawler

Automatically discover and extract video download links from any website. Crawl through multiple pages, follow custom link patterns, and export results in JSON, CSV, HTML, or XML formats. Perfect for content creators, researchers, and media professionals.

Rodrigo Franco

Crawl4ai

kael_odin/crawl4ai

Extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.

Kael Odin

Website Link Graph & Outbound Links Crawler

logiover/website-link-graph-crawler

Extract all links from a website to CSV/JSON. Maps internal & outbound link graph with anchor text + nofollow/rel flags. No API, no login.

Logiover

Broken Link Checker - Find Dead 404 Links

logiover/broken-link-checker

Site-wide broken link checker: crawl any website, find 404 and dead links, export the link audit to CSV or JSON with source page and status code.

Logiover

Broken Link Checker — Recursive Site Crawler

accurate_pouch/broken-link-checker

Recursively crawl your website and find every broken link, 404, redirect, and timeout. Checks internal and external links with configurable depth. 100 links free per run.