Pricing

from $0.15 / 1,000 discovered links

Website URL Crawler & Link Extractor

Crawl public websites and collect URLs from rendered navigation and sitemaps. Export a link map with source pages, depth, anchor text, HTTP facts, crawl status, and available sitemap metadata.

Pricing

from $0.15 / 1,000 discovered links

Rating

0.0

(0)

Developer

Maxime Dupré

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

🔗 Build a website URL inventory from links and sitemaps

Website URL Crawler is for SEO specialists, developers, and site owners who need a structured map of a public website. It collects accepted URLs from rendered navigation and public sitemaps, with hierarchy, classification, crawl state, HTTP facts for loaded pages, and available sitemap metadata.

Website crawler tool — create a URL inventory for a public website.
Website migration URL inventory — collect paths before planning redirects or comparing site structures.
Internal linking review — inspect parent pages, anchor text, and crawl depth.
Sitemap URL extractor — collect URLs and source-provided metadata from public sitemaps.
Rendered link crawler — find links exposed by page navigation after rendering.
Broken link workflow crawler — export loaded-page HTTP facts for downstream checks.

📦 Returned data

Each saved row is one accepted URL found through rendered navigation, sitemap discovery, or both. Fields include:

startUrl and normalized url
discoverySources: rendered, sitemap, or both
parentUrl, depth, and anchorText when rendered navigation provides them
relationship and linkType
crawlStatus: crawled or discovered
httpStatusCode, finalUrl, and contentType for loaded pages
sitemapUrl, lastmod, priority, and changefreq when supplied by a sitemap

Unavailable source metadata remains empty.

🚀 Running the Actor

Add one or more public website URLs, page URLs, or bare domains.
Choose the crawl scope.
Set row, page, depth, and per-page link limits.
Run the Actor and open its dataset.

Bare domains such as example.com are normalized to HTTPS. The Actor crawls public pages only; it does not log in, submit forms, scrape full page content, or check search-engine index status.

🧾 Input

Website URLs is required. The other public fields are:

URL keywords — words or path parts used by the URL filter setting.
Crawl scope — controls which links can be followed and saved. Same host and same domain save only internal URLs. The external option can also save external URLs, but does not follow them.
Asset links — selects pages, pages and documents, or all link types.
Ignored extensions — skips listed extensions unless all links are included.
Max URL rows — caps accepted rows across the run; 0 uses the other crawl limits only.
Max pages per website, Max crawl depth, and Max links per page — control traversal breadth.

{
  "urls": [{ "url": "https://example.com" }],
  "keywords": ["docs"],
  "crawlScope": "same-domain",
  "assetPolicy": "include-documents",
  "maxResults": 100,
  "maxPagesPerStartUrl": 10,
  "maxDepth": 2,
  "maxLinksPerPage": 120
}

📤 Output

A realistic dataset row looks like this:

{
  "startUrl": "https://example.com/",
  "url": "https://example.com/docs",
  "discoverySources": ["rendered", "sitemap"],
  "parentUrl": "https://example.com/",
  "depth": 1,
  "anchorText": "Documentation",
  "relationship": "internal",
  "linkType": "page",
  "crawlStatus": "crawled",
  "httpStatusCode": 200,
  "finalUrl": "https://example.com/docs",
  "contentType": "text/html; charset=utf-8",
  "sitemapUrl": "https://example.com/sitemap.xml",
  "lastmod": "2026-05-14",
  "priority": 0.5,
  "changefreq": "weekly"
}

Rows are available through the Apify dataset and its standard export formats.

💳 Pricing

This Actor uses pay-per-event pricing. The Website URL event is charged once for each accepted URL saved to the dataset. Discovered URLs that are not accepted or saved do not create this event. Start with a small Max URL rows value, inspect the output, and then broaden the limits if needed.

🔌 Integrations

Use Apify datasets, the API, schedules, webhooks, and platform integrations to process or deliver the URL inventory. This walkthrough shows how to connect an Actor to other services:

❓ FAQ

Why can a URL be marked `discovered` without HTTP fields?

A URL can come from a sitemap or an unvisited link. Only pages that the crawl loads receive crawled state and available HTTP response facts.

When are external URLs saved?

Same host and Same domain save only internal URLs. Include external URLs as discovered also saves external URLs, but the Actor does not crawl them.

Can I crawl only one submitted page?

Set Max crawl depth to 0 and use a low Max pages per website. Public sitemap discovery can still add sitemap-backed URLs.

Does it parse sitemap indexes?

Yes. It reads public XML URL sitemaps and nested sitemap indexes. Sitemap metadata is included only when the source supplies it.

Is this a full broken-link checker?

No. It provides HTTP facts for pages actually loaded during the crawl, which can support a downstream broken-link workflow. It does not promise an HTTP check for every discovered URL.

Does it scrape page text or private pages?

No. It extracts URL and link evidence from public pages and sitemaps. It does not return full page content, log in, or access private pages.

📝 Changelog

1.0: Added sitemap discovery, sitemap metadata, URL keyword filtering, total row limits, and the merged URL inventory output.

🆘 Support

For issues, questions, or feature requests, file a ticket and I'll fix or implement it in less than 24h 🫡

Sitemap Sniffer — discover sitemap files and optionally export their listed URLs without crawling site navigation.
XML Sitemap Validator — check sitemap-listed URLs for HTTP status, redirects, response time, and issues.
Robots.txt Generator — create and validate crawler directives after reviewing a site's crawl surface.
Seobility SEO Checker — run public single-page SEO checks on selected URLs from your inventory.
SEMrush Free Website Stats Scraper — add public SEMrush domain metrics to website research workflows.

Made with ❤️ by Maxime Dupré

Website URL Crawler & Link Extractor

maged120/get-urls-pro

Crawl any website and extract all URLs with full hierarchy — depth, parent URL, and anchor text. Supports static and JavaScript-rendered sites. Configurable depth and domain filtering.

Maged

125

NPI/NPPES Healthcare Provider Scraper

parseforge/npi-nppes-scraper

Supercharge your healthcare provider research with our NPI/NPPES Scraper! Automate comprehensive data collection from the National Plan and Provider Enumeration System (NPPES) Registry, saving hours of manual research and ensuring you get the most accurate, up-to-date healthcare provider information

ParseForge

5.0

Website URL Extractor - Get All Site URLs

lofomachines/urls-extractor

Extract every URL from any website automatically — no code needed. This URL extractor crawls pages and parses XML sitemaps into one structured list with metadata (lastmod, priority, changefreq). Filter by keyword, cap results, and export to JSON, CSV, or Excel. Built for SEO audits & migrations.

Lofomachines

195

5.0

Lead Enrichment API Multi-Provider B2B Data Enrichment

alizarin_refrigerator-owner/lead-enricher

Enrich your leads w/company & contact data from 10+ enrichment providers. Perfect for sales prospecting, lead scoring, CRM enrichment, and account-based marketing. 10 Provider Integrations: Apollo, Clearbit, ZoomInfo, IPinfo, FullContact, Hunter, Lusha, Snov, RocketReach & People Data Labs

The Howlers

NPPES NPI Registry Scraper & Lookup - Healthcare Provider Leads

pink_comic/nppes-npi-registry

Search official NPPES NPI Registry records by provider name, NPI, taxonomy, specialty, city, state, or ZIP. Export provider, organization, address, phone, taxonomy, and NPI-status evidence for directories, lead research, and credentialing prechecks.

Ava Torres

NPI Provider Contact Finder

labrat011/npi-provider-contact-finder

Find healthcare provider emails and contacts from NPI registry. Generate sales leads with doctor emails, LinkedIn profiles, practice websites. No API key.

mick_

FMCSA Trucking Scraper - Motor Carriers, Authority & Leads

scrapesage/fmcsa-trucking-scraper

Scrape US trucking companies from FMCSA open data: USDOT/MC carriers & brokers with phone, email, fleet size, operating authority, insurance & out-of-service status. Filter by state, operation & registration date for fresh new-carrier leads, plus monitoring. Keyless, no browser.

Scrape Sage

Doctors Email Scraper

contacts-api/doctors-email-scraper

Doctors Email Scraper to collect verified physician emails and contact details by specialty; location; and organization from medical directories and clinic websites 🩺📧 Perfect for healthcare B2B outreach; medtech sales; recruiting; and lead generation.

Lead Heaven

Get URLs from link

boring_code/get-urls-from-link

Extracts URLs from a sitemap or webpage with intuitive path matching. Use comma-separated patterns to include or exclude URL paths with smart matching: '/tags/' for exact paths, '/product' for paths starting with, or simple text for substring matches.

Audrius L.

262

5.0

Free Company Lookup — Domain, Industry, HQ & Logo

foxlabs/company-lookup-free

Free company lookup: type company names and get website/domain, industry, HQ, founded year, description & logo as clean JSON. Need revenue, employees, competitors, funding or tech stack? Use the foXLabs B2B Intelligence Suite (Owler / Craft.co / Built In).