Website URL Crawler & Link Extractor
Pricing
from $0.15 / 1,000 discovered links
Website URL Crawler & Link Extractor
Crawl JavaScript-rendered websites and export a URL link map. Get source pages, depth, anchor text, link type, HTTP metadata, and crawl status.
Pricing
from $0.15 / 1,000 discovered links
Rating
0.0
(0)
Developer
Maxime Dupré
Maintained by CommunityActor stats
0
Bookmarked
6
Total users
3
Monthly active users
3 days ago
Last modified
Categories
Share
🔗 Website URL crawler for rendered links and sitemaps
Website URL Crawler crawls public websites and extracts URLs from both rendered pages and sitemaps. Add one or more website URLs or domains, and the Actor returns a clean URL inventory with depth, parent URL, anchor text, link type, HTTP status, and sitemap metadata when the site provides it.
Use this website URL crawler for SEO audits, website migrations, QA checks, internal linking reviews, broken-link workflows, and RAG source inventories. It is useful when a plain sitemap URL extractor is not enough, because the same run can also find links from JavaScript-rendered navigation.
For a quick first run, keep the prefilled IANA reserved domains page, Crawlee, and Apify Docs. IANA is a small rendered-link crawl. Crawlee and Apify Docs have public sitemaps, so you can see sitemap-backed URL rows too.
🧭 What this Actor does
- Crawls one or more public website URLs or bare domains.
- Opens pages in a browser and extracts rendered anchor links.
- Discovers sitemap URLs from
robots.txtand common sitemap paths. - Parses sitemap URL sets, sitemap indexes, text sitemaps, and gzipped sitemap files.
- Merges rendered and sitemap evidence for the same normalized URL.
- Keeps hierarchy facts such as parent URL, depth, and anchor text when rendered navigation finds the link.
- Adds sitemap facts such as sitemap URL,
lastmod,priority, andchangefreqwhen available. - Filters the final URL inventory by keywords that appear in the URL.
- Exports rows from Apify as JSON, CSV, Excel, XML, RSS, HTML, or through the Apify API.
The Actor is made for URL discovery and link-map exports. It does not scrape full page content, submit forms, log in, click through menus, check search-index status, or mirror website files into storage.
📦 Data you get
Every saved item is one accepted website URL found from rendered navigation, sitemap discovery, or both.
startUrl- normalized submitted website or domain that produced the row.url- normalized URL found by the crawl.discoverySources-rendered,sitemap, or both.parentUrl- rendered page where the URL was found, when available.depth- rendered crawl depth from the start URL, when available.anchorText- visible link text from rendered navigation, when available.relationship- whether the URL is internal or external for the submitted website.linkType- page, document, media, or other.crawlStatus- whether the URL was loaded as a page or only discovered.httpStatusCode,finalUrl, andcontentType- page response facts for loaded pages when HTTP checks are enabled.sitemapUrl,lastmod,priority, andchangefreq- sitemap source and metadata when the sitemap provides them.
🚀 How to run it
- Add one or more public websites, domains, or page URLs.
- Leave URL keywords empty for a full URL inventory, or add words such as
blog,docs, orpricingto keep only matching URLs. - Set
Max URL rowsto control the total output size and cost. - Choose how many rendered pages to open per website.
- Pick the crawl depth and maximum rendered links per page.
- Choose whether to stay on the same host, stay on the same domain, or include external URLs as discovered rows.
- Run the Actor and open the dataset.
Domains such as example.com are accepted and normalized to HTTPS. Full URLs such as https://example.com/docs are also accepted.
🧾 Input options
Website URLs is the only required input. Add the websites or pages you want to map.
URL keywords keeps URLs that contain all listed words or path parts. The filter checks URL text, not page meaning.
Max URL rows limits accepted output rows across the whole run. Use 0 only when you want every URL found within the other crawl limits.
Max pages per website controls how many rendered pages are opened for each submitted website.
Max crawl depth controls how many link levels the Actor follows from each start URL. Use 0 when you only want links from the submitted page plus sitemap-discovered URLs.
Max links per page limits how many rendered links are saved and considered from each loaded page.
Crawl scope controls which internal links can be followed. External links can be saved as discovered rows, but they are not crawled further.
Asset links controls whether the dataset includes only page URLs, page plus document URLs, or all links including media assets.
Ignored extensions skips common file extensions unless Asset links is set to include all links.
Check HTTP status adds status code, final URL, and content type for loaded pages.
🧪 Output example
{"startUrl": "https://crawlee.dev/","url": "https://crawlee.dev/blog","discoverySources": ["sitemap", "rendered"],"parentUrl": "https://crawlee.dev/js","depth": 2,"anchorText": "Blog","relationship": "internal","linkType": "page","crawlStatus": "discovered","httpStatusCode": null,"finalUrl": null,"contentType": null,"sitemapUrl": "https://crawlee.dev/sitemap.xml","lastmod": null,"priority": 0.5,"changefreq": "weekly"}
💳 Pricing
This Actor uses pay-per-event pricing. You are charged for each accepted website URL saved to the dataset. The pricing event is called Discovered link.
Use a small Max URL rows value for your first run. Once the output looks right, increase Max URL rows, Max pages per website, and Max crawl depth for broader website inventories.
⚠️ Limits and caveats
Website URL Crawler uses a browser for rendered link discovery, so it favors coverage over the lowest possible runtime cost. Large sites can publish thousands of sitemap URLs and many rendered links; set limits before running broad crawls.
Sitemap metadata is source-backed only. If a sitemap omits lastmod, priority, or changefreq, those fields stay empty.
HTTP status, final URL, and content type are available for loaded pages. URLs that are only discovered from a sitemap or an unvisited link may not have those page response fields.
The Actor reads public website URLs. It does not use source credentials, user cookies, private APIs, browser extensions, or page content enrichment.
❓ FAQ
🌐 Does this crawl JavaScript-rendered websites?
Yes. Rendered pages are opened in a browser, and links are extracted after the page loads.
🗺️ Does it parse sitemaps?
Yes. The Actor checks robots.txt, common sitemap paths, sitemap indexes, text sitemaps, and gzipped sitemap files. It saves sitemap metadata when the source sitemap provides it.
🔎 Can I filter URLs by keyword?
Yes. Add URL keywords such as blog, docs, or product. A URL must contain every listed keyword to be saved.
🌍 Will it crawl external websites too?
No. External URLs can be saved as discovered rows when your settings allow them, but the crawler only follows internal page URLs within the selected scope.
📄 Can I crawl only one page?
Yes. Set Max crawl depth to 0 and keep Max pages per website low when you only want the submitted page's rendered links plus sitemap-discovered URLs.
🧯 Is this a broken link checker?
It can support broken-link workflows by exporting URL rows and HTTP metadata for loaded pages, but the main output is a website URL inventory and link map.
📝 Changelog
- 1.0: Added sitemap discovery, sitemap metadata, URL keyword filtering, total row limits, and the new merged URL inventory output.
🆘 Support
For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡
🔗 Other actors
- Sitemap Sniffer ↗ - Find public sitemap files and optional sitemap URL inventory rows.
- Website Emails Scraper ↗ - Find public email addresses on websites you already plan to crawl.
- Business Address Scraper ↗ - Extract physical business addresses from public company websites.
- Font Detector ↗ - Audit fonts, font files, and typography metadata from public pages.
- SEMrush Free Website Stats Scraper ↗ - Export public SEMrush website stats for domains and URLs.
Made with ❤️ by Maxime Dupré