Get Urls Pro avatar
Get Urls Pro

Pricing

$8.00/month + usage

Go to Store
Get Urls Pro

Get Urls Pro

Developed by

Maged Safwat

Maged Safwat

Maintained by Community

This Apify actor crawls websites, extracts and creates a hierarchy of links, allowing you to visualize the structure of a website. The crawler can be configured to use either standard HTTP requests with BeautifulSoup (fast HTML parsing) or Selenium (for JavaScript-heavy pages)

0.0 (0)

Pricing

$8.00/month + usage

0

Total users

4

Monthly users

3

Runs succeeded

>99%

Last modified

18 days ago

Website Crawler

This Apify actor crawls websites, extracts and creates a hierarchy of links, allowing you to visualize the structure of a website. The crawler can be configured to use either standard HTTP requests with BeautifulSoup (fast HTML parsing) or Selenium (for JavaScript-heavy pages).

Features

  • Crawl any website starting from a specified URL
  • Control crawl depth and number of links per page
  • Filter out specific file extensions
  • Option to use Selenium for JavaScript-heavy websites
  • Prevent duplicate URLs in the output
  • Proxy support (via Apify Proxy)

Input Parameters

ParameterTypeDescription
startUrlStringThe starting URL to crawl (e.g., https://jamesclear.com/five-step-creative-process)
useSeleniumBooleanUse Selenium for JavaScript-heavy pages
allowDuplicatesBooleanAllow duplicate URLs in the output
maxDepthIntegerMaximum depth of link recursion (1-30)
maxChildrenPerLinkIntegerMaximum number of children per parent link (1-100)
sameDomainOnlyBooleanonly crawl urls with the same domain as the start url, (default: true)
ignoredExtensionsArrayFile extensions to ignore when crawling

Output

The actor outputs a JSON object with the following structure:

[
{
"url": "https://jamesclear.com/five-step-creative-process",
"name": null,
"query": "",
"depth": 0,
"parentUrl": null
},
{
"url": "https://jamesclear.com/",
"name": null,
"query": "",
"depth": 1,
"parentUrl": "https://jamesclear.com/five-step-creative-process"
},
{
"url": "https://jamesclear.com/books",
"name": "Books",
"query": "",
"depth": 1,
"parentUrl": "https://jamesclear.com/five-step-creative-process"
},
{
"url": "https://jamesclear.com/articles",
"name": "Articles",
"query": "",
"depth": 1,
"parentUrl": "https://jamesclear.com/five-step-creative-process"
},
{
"url": "https://jamesclear.com/3-2-1",
"name": "Newsletter",
"query": "",
"depth": 2,
"parentUrl": "https://jamesclear.com/"
},
{
"url": "https://jamesclear.com/events?g=4",
"name": "Speaking",
"query": "g=4",
"depth": 2,
"parentUrl": "https://jamesclear.com/"
}
]

Example Usage

Basic Crawl

To create a basic map of a website with default settings:

{
"startUrl": "https://google.com",
"useSelenium": false,
"maxDepth": 2,
"maxChildrenPerLink": 5,
}

Deep Crawl with Selenium

For a deeper crawl of a JavaScript-heavy website:

{
"startUrl": "https://jamesclear.com/five-step-creative-process",
"useSelenium": true,
"maxDepth": 2,
"maxChildrenPerLink": 5,
"allowDuplicates": false,
"ignoredExtensions": ["gif", "jpg", "png", "css", "jpeg", "pdf", "doc", "docx"]
}

Implementation Details

This actor is built with:

  • Apify Python SDK
  • BeautifulSoup for standard HTML parsing
  • Selenium with Chrome WebDriver for JavaScript-heavy pages
  • Asynchronous processing for better performance

notes

  • JavaScript-heavy pages may require the useSelenium option enabled
  • Very large websites should use lower maxDepth and maxChildrenPerLink values to avoid hitting memory limits, or talking way long time