Website URL Crawler & Link Extractor
Pricing
from $10.80 / 1,000 discovered website links
Website URL Crawler & Link Extractor
Crawl JavaScript-rendered websites and export a URL link map. Get source pages, depth, anchor text, link type, HTTP metadata, and crawl status.
Pricing
from $10.80 / 1,000 discovered website links
Rating
0.0
(0)
Developer
Maxime Dupré
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
🔗 Website URL crawler for rendered pages
Website URL Crawler crawls JavaScript-rendered public websites and exports a clean link map. Add one or more website URLs or domains, and the Actor opens pages in a browser, reads the rendered links, follows the pages you allow, and saves one dataset item per discovered link.
Use it for SEO audits, website migrations, QA checks, broken-link investigation, internal linking reviews, and RAG source inventories. It works well when you need more than a raw list of URLs: each link keeps its source page, parent URL, depth, anchor text, link type, crawl status, and optional HTTP metadata.
For a quick first run, keep the prefilled IANA reserved domains page. It is small, public, and gives you a readable website link map without needing your own test site.
🧭 What this Actor does
Website URL Crawler starts from your submitted websites and discovers links from rendered HTML pages. That means links added by common client-side JavaScript can be included in the crawl output after the page loads.
The Actor can crawl within the same host, within the same registrable domain, or emit external links as discovered-only rows. Only internal page links are followed further. Document and media links can be included or skipped depending on the asset setting you choose.
Each run is designed for link extraction and crawl mapping, not full content scraping. The output helps you answer practical questions such as:
- Which pages does this website link to?
- Where was each URL found?
- What anchor text points to each link?
- How deep is the link from the start page?
- Is the link internal, external, a document, or an asset?
- Which crawled pages returned HTTP status and content type metadata?
📦 Data you get
Every saved item represents one crawled or discovered website link. Fields include:
startUrl- the original website URL this crawl started fromurlandnormalizedUrl- the discovered link and its normalized versionsourceUrl- the rendered page where the link was foundparentUrl- the page that led to a crawled URL, when availabledepth- crawl depth from the start URLanchorText- visible link text when presentlinkType- page, document, asset, or externalcrawlStatus- crawled or discoveredhttpStatusCode,finalUrl, andcontentType- when HTTP status checks are enabled and the page is crawledisInternal,isExternal,isAsset, andisDuplicate- booleans for filtering and auditsrawHref,foundOnTitle,sourceIndex, anddiscoveredAt- source evidence and scrape metadata
You can export the dataset from Apify as JSON, CSV, Excel, XML, RSS, or HTML, or consume it through the Apify API, schedules, webhooks, and integrations.
⚙️ How to run it
- Add one or more website URLs or domains.
- Choose how many pages to crawl per website.
- Set the crawl depth and maximum links per page.
- Pick whether to stay on the same host, same domain, or include external links as discovered-only rows.
- Choose whether to include document links, all asset links, or pages only.
- Run the Actor and open the dataset.
Domains such as example.com are accepted and normalized to HTTPS. Full URLs such as https://example.com/docs are also accepted.
🧾 Input options
Website URLs is the only required input. Add the sites you want to crawl.
Max pages per website controls how many HTML pages are opened for each start URL. Discovered links can still be emitted before the page cap is reached.
Max crawl depth controls how many levels of links the Actor follows from the start page. Use 0 when you only want links from the submitted page itself.
Max links per page limits how many rendered links are emitted and considered from each crawled page.
Crawl scope controls which internal links can be followed. External links are never crawled further; they can be emitted as discovered-only rows when your settings allow it.
Asset links controls whether the dataset includes only HTML page links, document links such as PDFs and spreadsheets, or all links including media assets.
Ignored extensions lets you skip common file types unless you choose to include all links.
Check HTTP status adds status code, final URL, and content type for crawled pages.
🧪 Output example
{"startUrl": "https://www.iana.org/domains/reserved","url": "https://www.iana.org/domains/root","normalizedUrl": "https://www.iana.org/domains/root","sourceUrl": "https://www.iana.org/domains/reserved","parentUrl": "https://www.iana.org/domains/reserved","depth": 1,"anchorText": "Root Zone Management","linkType": "page","crawlStatus": "discovered","isInternal": true,"isExternal": false,"isAsset": false,"isDuplicate": false,"rawHref": "/domains/root","foundOnTitle": "IANA-managed Reserved Domains","sourceIndex": 24,"discoveredAt": "2026-05-26T00:00:00.000Z"}
💳 Pricing
This Actor uses pay-per-event pricing. You are charged for each saved website link item. The pricing event is called Discovered website link.
Use a small Max pages per website value for your first run, then increase the limit once the output shape looks right.
⚠️ Limits and caveats
Website URL Crawler is browser-rendered, so it is designed for capability over minimum runtime cost. Large sites can produce many links quickly; start with a small page limit and expand from there.
The Actor reads links from public rendered pages. It does not log in, submit forms, click through interactive menus, or guarantee that every route in a single-page app is discoverable from normal anchor links.
HTTP status, final URL, and content type are available for crawled pages. Links that are only discovered but not crawled are still useful for mapping, but they may not have those HTTP fields.
❓ FAQ
🌐 Does this crawl JavaScript-rendered websites?
Yes. Pages are opened in a browser and links are extracted from the rendered page, not only the initial HTML response.
🌍 Will it crawl external websites too?
No. External links can be saved as discovered links, but the crawler only follows internal page links within the scope you choose.
📄 Can I crawl only one page?
Yes. Set Max crawl depth to 0 when you want links from the submitted page without following deeper links.
🧯 Is this a broken link checker?
It can help with broken-link workflows by exporting discovered links and HTTP metadata for crawled pages, but the core output is a website URL crawl map.
📝 Changelog
- 0.0: Initial release.
🆘 Support
For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡
🔗 Other actors
- Website Emails Scraper ↗ - Find public email addresses on the websites you already crawl.
- Font Detector ↗ - Audit fonts, font files, and typography metadata from public pages.
- Business Address Scraper ↗ - Extract physical business addresses from company websites.
- Product Hunt Scraper ↗ - Build startup lead lists and enrich launches with website details.
- LinkedIn Company Scraper ↗ - Export public company profile data for lead and market research.
Made with ❤️ by Maxime Dupré