Sorrek is a domain intelligence provider that tracks domain redirects for over 300mm websites globally to identify and resolve duplicates in company datasets such as CRMs. For more information on the types of data we provide, please visit sorrek.io.

Check out our other Actors too!

🚀 What makes this better than urllib or other tools?

We developed this URL parser internally after existing tools and packages, such as urllib couldn't live up to the task. We found that these tools would use simple text parsing methods, such as splitting a URL by a period, which didn't work in some cases. Namely, if a website had an uncommon top-level domain, a second-level domain, several subdomains, a complicated URL slug, or some other challenge, the output from those tools would cause downstream errors due to an incorrect output.

Our URL Parser is purpose-built for extracting the individual pieces of even the messiest URLs. We've curated a comprehensive list of Top-Level Domains to identify the correct extension in the URL. Then, from there, we're able to extract all of the other pieces with greater accuracy while retaining the speed and efficiency of other tools.

Domain name extraction from addresses is also supported.

All these features make identifying valid webpages on HTML page link extracts from BeautifulSoup, and other libraries, far more reliable and can save time and computational effort as well, in addition to identifying the different parts of the URL to help prioritize crawling.

🌐 What is a Top-Level Domain (TLD)?

A Top-Level Domain, also known as a TLD or a domain extension, is the highest level in the hierarchical domain name system of the internet. It is the last part of a domain name, appearing after the final dot in an internet address. TLDs are divided into two main categories:

Generic Top-Level Domains (gTLDs):
- .com
- .org
- .net
- .gov
- .edu
- .info
Country Code Top-Level Domains (ccTLDs):
- .us (United States)
- .uk (United Kingdom)
- .ca (Canada)
- .au (Australia)
- .jp (Japan)
- .de (Germany)

This is of course not a comprehensive list, as there are thousands of options and several other categories. This is further expanded on by the use of Second-Level Domains, which often exist within a country code TLD (ccTLD) structure. Some examples are:

.co.uk
.com.au
.gov.uk

The Sorrek URL Parser currently recognizes over 5,000 TLDs, including Second-Level Domains. This list is monitored and updated frequently by checking the ICANN and IANA databases regularly and by analyzing the results of our parser against millions of new domain registrations each month.

More information on TLDs can be found here.

📥 Input

The input for the URL Parser is a list of URLs that you would like to parse. If you only wish to parse one URL you will need to include it in its own list.

📤 Output Details

The output of the URL Parser is a JSON string with the following fields:

input_url: The input URL value.
extracted_status: The status of the extraction for this particular record.
extracted_ssl: The SSL for the URL (e.g. http or https).
extracted_subdomain: The full subdomain string, if more than one exists then they will all be included as a single string.
extracted_branded_domain: The domain name in the URL without a top-level domain (TLD), for example the extracted_branded_domain for https://docs.apify.com/ would be "apify".
extracted_extension: The top-level domain (TLD) for the URL.
extracted_slug: The slug for the URL, for example, the extracted_slug for https://apify.com/about would be "about".
extracted_query: The query string in the URL, if it exists. This is relatively uncommon but would appear after a ? in a URL slug. If multiple query parameters exist, denoted by multiple ? characters, then they will be concatenated and separated by a ?.
extracted_port: The port in the URL, if it exists.
extracted_domain_name: The full domain name. This is a concatenation of the extracted_branded_domain and extracted_extension, and is what you would normally see in an email address.
extracted_url: This is your input URL with the extracted_query and extracted_port removed.
extracted_website: The website. This is a concatenation of the extracted_domain_name and extracted_slug.

If the input URL is not valid, in that it does not conform to a URL format (e.g. it includes special characters that can't exist in URLs or is missing a valid top-level domain), then it will return a None-type value for all of the extracted fields. NOTE: This tool does not determine if a URL or website exists, only that it conforms to an accepted URL format.

🔍 Output Samples

Sample 1: Parsing a single URL

Input: ["https://en.wikipedia.org/wiki/Top-level_domain"]

[
  {
    "input_url": "https://en.wikipedia.org/wiki/Top-level_domain",
    "extracted_status": "Success",
    "extracted_ssl": "https",
    "extracted_subdomain": "en",
    "extracted_branded_domain": "wikipedia",
    "extracted_extension": "org",
    "extracted_slug": "wiki/top-level_domain",
    "extracted_query": null,
    "extracted_port": null,
    "extracted_domain_name": "wikipedia.org",
    "extracted_url": "https://en.wikipedia.org/wiki/top-level_domain",
    "extracted_website": "en.wikipedia.org/wiki/top-level_domain"
  }
]

Sample 2: Parsing multiple URLs

Input: ["https://sorrek.io/", "https://www.godaddy.com/domainsearch/find?domainToCheck=examplesite.com", "https://www.sbs.ox.ac.uk/about-us/school/our-history", "https://calculator.aws/#/", "no-reply@sorrek.io", "https://invalid.domain"]

[
  {
    "input_url": "https://sorrek.io/",
    "extracted_status": "Success",
    "extracted_ssl": "https",
    "extracted_subdomain": null,
    "extracted_branded_domain": "sorrek",
    "extracted_extension": "io",
    "extracted_slug": null,
    "extracted_query": null,
    "extracted_port": null,
    "extracted_domain_name": "sorrek.io",
    "extracted_url": "https://sorrek.io",
    "extracted_website": "sorrek.io"
  },
  {
    "input_url": "https://www.godaddy.com/domainsearch/find?domainToCheck=examplesite.com",
    "extracted_status": "Success",
    "extracted_ssl": "https",
    "extracted_subdomain": "www",
    "extracted_branded_domain": "godaddy",
    "extracted_extension": "com",
    "extracted_slug": "domainsearch/find",
    "extracted_query": "?domaintocheck=examplesite.com",
    "extracted_port": null,
    "extracted_domain_name": "godaddy.com",
    "extracted_url": "https://www.godaddy.com/domainsearch/find",
    "extracted_website": "www.godaddy.com/domainsearch/find"
  },
  {
    "input_url": "https://www.sbs.ox.ac.uk/about-us/school/our-history",
    "extracted_status": "Success",
    "extracted_ssl": "https",
    "extracted_subdomain": "www.sbs",
    "extracted_branded_domain": "ox",
    "extracted_extension": "ac.uk",
    "extracted_slug": "about-us/school/our-history",
    "extracted_query": null,
    "extracted_port": null,
    "extracted_domain_name": "ox.ac.uk",
    "extracted_url": "https://www.sbs.ox.ac.uk/about-us/school/our-history",
    "extracted_website": "www.sbs.ox.ac.uk/about-us/school/our-history"
  },
  {
    "input_url": "https://calculator.aws/#/",
    "extracted_status": "Success",
    "extracted_ssl": "https",
    "extracted_subdomain": null,
    "extracted_branded_domain": "calculator",
    "extracted_extension": "aws",
    "extracted_slug": "#/",
    "extracted_query": null,
    "extracted_port": null,
    "extracted_domain_name": "calculator.aws",
    "extracted_url": "https://calculator.aws/#/",
    "extracted_website": "calculator.aws/#/"
  },
  {
    "input_url": "no-reply@sorrek.io",
    "extracted_status": "Success",
    "extracted_ssl": null,
    "extracted_subdomain": null,
    "extracted_branded_domain": "sorrek",
    "extracted_extension": "io",
    "extracted_slug": null,
    "extracted_query": null,
    "extracted_port": null,
    "extracted_domain_name": "sorrek.io",
    "extracted_url": "sorrek.io",
    "extracted_website": "sorrek.io"
  },
  {
    "input_url": "https://invalid.domain",
    "extracted_status": null,
    "extracted_ssl": null,
    "extracted_subdomain": null,
    "extracted_branded_domain": null,
    "extracted_extension": null,
    "extracted_slug": null,
    "extracted_query": null,
    "extracted_port": null,
    "extracted_domain_name": null,
    "extracted_url": null,
    "extracted_website": null
  }
]

📣 Your Feedback

We are always working on improving the perfmance of our Actors. So if you have any technical feedback for our Scraper or found a bug, please create an issue on the Actor's Issues tab in the Apify Console.

Happy probing! 🕵️‍♂️✨

Get Urls Pro

maged120/get-urls-pro

This Apify actor crawls websites, extracts and creates a hierarchy of links, allowing you to visualize the structure of a website. The crawler can be configured to use either standard HTTP requests with BeautifulSoup (fast HTML parsing) or Selenium (for JavaScript-heavy pages)

Maged

5.0

Sorrek: Domain hosting info, mail server, IP address, and more

sorrek/sorrek

Scrape DNS data, IP addresses, domain status, mail server details, and hosting provider info from arin.net

Sorrek

Advanced Website Domain Name Validator

saswave/advanced-website-domain-name-validator

Advanced domain scraper. Determine if a domain is still valid or has moved. We test multiple scenario before flagging the domain as invalid. Extract technologies stack, social account, emails

SASWAVE

Get URLs from link

boring_code/get-urls-from-link

Extracts URLs from a sitemap or webpage with intuitive path matching. Use comma-separated patterns to include or exclude URL paths with smart matching: '/tags/' for exact paths, '/product' for paths starting with, or simple text for substring matches.

Audrius L.

5.0

Get Urls

maged120/get-urls

scrap links from any website, now has a RENTAL version https://apify.com/maged120/get-urls-rental

Maged

3.7

Internal Links Scraper

mysteriousshadow/internal-links-scraper

When given a sitemap of a website, this scraper will go through every page listed on the sitemap and find all the internal links. Useful for SEO, finding orphaned pages, and visualizing internal linking structure.

Mysterious Shadow

Get Urls Pro PPR

maged120/get-urls-pro-ppr

Maged

5.0

Website domain name validator

saswave/website-domain-name-validator

Verify if a website is still up and running or if it contains a redirection to a new domain name. Data enrichment at scale.

SASWAVE

Get Urls Rental

maged120/get-urls-rental

scrap links from any website, ppr version: https://apify.com/maged120/get-urls

Maged

5.0

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.