URL Parser: Parse for domains, subdomains, slugs, and more avatar

URL Parser: Parse for domains, subdomains, slugs, and more

Try for free

7 days trial then $8.00/month - No credit card required now

Go to Store
URL Parser: Parse for domains, subdomains, slugs, and more

URL Parser: Parse for domains, subdomains, slugs, and more

sorrek/url-parser
Try for free

7 days trial then $8.00/month - No credit card required now

A more powerful URL parser to extract domain names, subdomains, slugs, extensions/TLDs, and more

A more powerful URL parser to extract domain names, subdomains, slugs, extensions/TLDs, and more.

👋 About Us

Sorrek is a domain intelligence provider that tracks domain redirects for over 300mm websites globally to identify and resolve duplicates in company datasets such as CRMs. For more information on the types of data we provide, please visit sorrek.io.

Check out our other Actors too!

🚀 What makes this better than urllib or other tools?

We developed this URL parser internally after existing tools and packages, such as urllib couldn't live up to the task. We found that these tools would use simple text parsing methods, such as splitting a URL by a period, which didn't work in some cases. Namely, if a website had an uncommon top-level domain, a second-level domain, several subdomains, a complicated URL slug, or some other challenge, the output from those tools would cause downstream errors due to an incorrect output.

Our URL Parser is purpose-built for extracting the individual pieces of even the messiest URLs. We've curated a comprehensive list of Top-Level Domains to identify the correct extension in the URL. Then, from there, we're able to extract all of the other pieces with greater accuracy while retaining the speed and efficiency of other tools.

Domain name extraction from addresses is also supported.

All these features make identifying valid webpages on HTML page link extracts from BeautifulSoup, and other libraries, far more reliable and can save time and computational effort as well, in addition to identifying the different parts of the URL to help prioritize crawling.

🌐 What is a Top-Level Domain (TLD)?

A Top-Level Domain, also known as a TLD or a domain extension, is the highest level in the hierarchical domain name system of the internet. It is the last part of a domain name, appearing after the final dot in an internet address. TLDs are divided into two main categories:

  1. Generic Top-Level Domains (gTLDs):
    • .com
    • .org
    • .net
    • .gov
    • .edu
    • .info
  2. Country Code Top-Level Domains (ccTLDs):
    • .us (United States)
    • .uk (United Kingdom)
    • .ca (Canada)
    • .au (Australia)
    • .jp (Japan)
    • .de (Germany)

This is of course not a comprehensive list, as there are thousands of options and several other categories. This is further expanded on by the use of Second-Level Domains, which often exist within a country code TLD (ccTLD) structure. Some examples are:

  • .co.uk
  • .com.au
  • .gov.uk

The Sorrek URL Parser currently recognizes over 5,000 TLDs, including Second-Level Domains. This list is monitored and updated frequently by checking the ICANN and IANA databases regularly and by analyzing the results of our parser against millions of new domain registrations each month.

More information on TLDs can be found here.

📥 Input

The input for the URL Parser is a list of URLs that you would like to parse. If you only wish to parse one URL you will need to include it in its own list.

📤 Output Details

The output of the URL Parser is a JSON string with the following fields:

  • input_url: The input URL value.
  • extracted_status: The status of the extraction for this particular record.
  • extracted_ssl: The SSL for the URL (e.g. http or https).
  • extracted_subdomain: The full subdomain string, if more than one exists then they will all be included as a single string.
  • extracted_branded_domain: The domain name in the URL without a top-level domain (TLD), for example the extracted_branded_domain for https://docs.apify.com/ would be "apify".
  • extracted_extension: The top-level domain (TLD) for the URL.
  • extracted_slug: The slug for the URL, for example, the extracted_slug for https://apify.com/about would be "about".
  • extracted_query: The query string in the URL, if it exists. This is relatively uncommon but would appear after a ? in a URL slug. If multiple query parameters exist, denoted by multiple ? characters, then they will be concatenated and separated by a ?.
  • extracted_port: The port in the URL, if it exists.
  • extracted_domain_name: The full domain name. This is a concatenation of the extracted_branded_domain and extracted_extension, and is what you would normally see in an email address.
  • extracted_url: This is your input URL with the extracted_query and extracted_port removed.
  • extracted_website: The website. This is a concatenation of the extracted_domain_name and extracted_slug.

If the input URL is not valid, in that it does not conform to a URL format (e.g. it includes special characters that can't exist in URLs or is missing a valid top-level domain), then it will return a None-type value for all of the extracted fields. NOTE: This tool does not determine if a URL or website exists, only that it conforms to an accepted URL format.

🔍 Output Samples

Sample 1: Parsing a single URL

Input: ["https://en.wikipedia.org/wiki/Top-level_domain"]

1[
2  {
3    "input_url": "https://en.wikipedia.org/wiki/Top-level_domain",
4    "extracted_status": "Success",
5    "extracted_ssl": "https",
6    "extracted_subdomain": "en",
7    "extracted_branded_domain": "wikipedia",
8    "extracted_extension": "org",
9    "extracted_slug": "wiki/top-level_domain",
10    "extracted_query": null,
11    "extracted_port": null,
12    "extracted_domain_name": "wikipedia.org",
13    "extracted_url": "https://en.wikipedia.org/wiki/top-level_domain",
14    "extracted_website": "en.wikipedia.org/wiki/top-level_domain"
15  }
16]

Sample 2: Parsing multiple URLs

Input: ["https://sorrek.io/", "https://www.godaddy.com/domainsearch/find?domainToCheck=examplesite.com", "https://www.sbs.ox.ac.uk/about-us/school/our-history", "https://calculator.aws/#/", "no-reply@sorrek.io", "https://invalid.domain"]

1[
2  {
3    "input_url": "https://sorrek.io/",
4    "extracted_status": "Success",
5    "extracted_ssl": "https",
6    "extracted_subdomain": null,
7    "extracted_branded_domain": "sorrek",
8    "extracted_extension": "io",
9    "extracted_slug": null,
10    "extracted_query": null,
11    "extracted_port": null,
12    "extracted_domain_name": "sorrek.io",
13    "extracted_url": "https://sorrek.io",
14    "extracted_website": "sorrek.io"
15  },
16  {
17    "input_url": "https://www.godaddy.com/domainsearch/find?domainToCheck=examplesite.com",
18    "extracted_status": "Success",
19    "extracted_ssl": "https",
20    "extracted_subdomain": "www",
21    "extracted_branded_domain": "godaddy",
22    "extracted_extension": "com",
23    "extracted_slug": "domainsearch/find",
24    "extracted_query": "?domaintocheck=examplesite.com",
25    "extracted_port": null,
26    "extracted_domain_name": "godaddy.com",
27    "extracted_url": "https://www.godaddy.com/domainsearch/find",
28    "extracted_website": "www.godaddy.com/domainsearch/find"
29  },
30  {
31    "input_url": "https://www.sbs.ox.ac.uk/about-us/school/our-history",
32    "extracted_status": "Success",
33    "extracted_ssl": "https",
34    "extracted_subdomain": "www.sbs",
35    "extracted_branded_domain": "ox",
36    "extracted_extension": "ac.uk",
37    "extracted_slug": "about-us/school/our-history",
38    "extracted_query": null,
39    "extracted_port": null,
40    "extracted_domain_name": "ox.ac.uk",
41    "extracted_url": "https://www.sbs.ox.ac.uk/about-us/school/our-history",
42    "extracted_website": "www.sbs.ox.ac.uk/about-us/school/our-history"
43  },
44  {
45    "input_url": "https://calculator.aws/#/",
46    "extracted_status": "Success",
47    "extracted_ssl": "https",
48    "extracted_subdomain": null,
49    "extracted_branded_domain": "calculator",
50    "extracted_extension": "aws",
51    "extracted_slug": "#/",
52    "extracted_query": null,
53    "extracted_port": null,
54    "extracted_domain_name": "calculator.aws",
55    "extracted_url": "https://calculator.aws/#/",
56    "extracted_website": "calculator.aws/#/"
57  },
58  {
59    "input_url": "no-reply@sorrek.io",
60    "extracted_status": "Success",
61    "extracted_ssl": null,
62    "extracted_subdomain": null,
63    "extracted_branded_domain": "sorrek",
64    "extracted_extension": "io",
65    "extracted_slug": null,
66    "extracted_query": null,
67    "extracted_port": null,
68    "extracted_domain_name": "sorrek.io",
69    "extracted_url": "sorrek.io",
70    "extracted_website": "sorrek.io"
71  },
72  {
73    "input_url": "https://invalid.domain",
74    "extracted_status": null,
75    "extracted_ssl": null,
76    "extracted_subdomain": null,
77    "extracted_branded_domain": null,
78    "extracted_extension": null,
79    "extracted_slug": null,
80    "extracted_query": null,
81    "extracted_port": null,
82    "extracted_domain_name": null,
83    "extracted_url": null,
84    "extracted_website": null
85  }
86]

📣 Your Feedback

We are always working on improving the perfmance of our Actors. So if you have any technical feedback for our Scraper or found a bug, please create an issue on the Actor's Issues tab in the Apify Console.

Happy probing! 🕵️‍♂️✨

Developer
Maintained by Community

Actor Metrics

  • 6 monthly users

  • 1 star

  • >99% runs succeeded

  • Created in Jan 2024

  • Modified 10 days ago