VATIN PL Finder avatar

VATIN PL Finder

Deprecated
Go to Store
This Actor is deprecated

This Actor is unavailable because the developer has decided to deprecate it. Would you like to try a similar Actor instead?

See alternative Actors
VATIN PL Finder

VATIN PL Finder

whitefox/nip-finder

The actor finds and retrieves the VATIN number (Polish 'numer NIP') from Polish websites. Scraping takes place exclusively on subpages where there is the highest probability of finding the PL VATIN, e.g. contact, privacy policy, terms and conditions. This tool can be used to enrich data in your CRM

PL VATIN Finder

Introduction

The PL VATIN Finder is an Apify actor that finds and retrieves VAT Identification Numbers (Polish 'numer NIP') from Polish websites. The actor performs scraping exclusively on subpages where there is a higher probability of finding the PL VATIN, such as contact, privacy policy, or terms and conditions pages. Download your data in various formats: HTML table, JSON, CSV, Excel, and more.

Use Cases

This tool can be used to fetch an unlimited number of NIP numbers from websites. The collected data can be utilized for various purposes such as:

Enriching data in your CRM system: If you have the domain addresses of your prospects and leads, this data allows you to remove duplicate companies in your system. By combining it with other available databases like CEIDG or KRS, you can enrich your CRM data with additional information.

Updating or completing accounting data: This tool can help you fill in missing data or update existing accounting records.

Prospecting: Possessing the NIP number enables you to connect with databases like CEIDG, CRBR, or KRS and obtain information about the company owners. With the domain and the owner's name, you can generate an email address to reach the decision-maker directly.

Input

  • start_urls (array): List of URLs to start the scraping.
  • waitForRedirect (boolean): Whether to wait for 3 seconds to handle redirects (default: false).

Example Input

1{
2  "start_urls": ["https://example.com"],
3  "waitForRedirect": false
4}

Output

The actor stores its results in the default dataset. Each item is an object containing the following properties:

  • url (string): The URL where the NIP(s) were found.
  • title (string): The title of the page.
  • nips (array): A list of found NIPs on the page.

How it works

The actor scrapes specified websites in search of a Polish tax identification number (NIP). The data retrieval process looks as follows:

  1. The actor loads the provided websites.

  2. The program checks if there is a redirect. If we provide it with the address wp.com.pl and there is a redirect, the actor will search the target website, e.g., wp.pl.

  3. The actor searches for NIP numbers only within the given domain, taking into account the first redirect.

  4. It is possible to activate the WaitforRedirect parameter. This will cause the actor to wait 3 seconds for the entire page to load after loading each website. In practice, it is worth activating this parameter if we are scraping a website whose redirect has been designed in Javascript. However, this is a rather rare case.

  5. Upon entering the website, the actor collects all links up to 1 level deep. It searches the links for specific keywords from the provided list:

['kontakt', 'contact', 'polityka', 'przetwarzanie', 'regulamin', 'zwrot', 'privacy'] + main page

The rest of the subpages are ignored. Thanks to this, the actor can search hundreds of pages in a very short time with high efficiency. This works particularly well when looking for a NIP number on ecommerce-type websites.

  1. To search for the VATIN PL number on the page, a regular expression is used, which, after many tests, gave the best results. The effectiveness, depending on the input, is about 40% for all pages. If the NIP number is on every page, the effectiveness increases to about 80-90%. It covers cases such as:
  • NIP: 9721276346
  • NIP 769-19-76-060
  • NIP: 542-000-01-62
  1. Data can be saved in any format, such as CSV, JSON, or XLSX.
Developer
Maintained by Community
Categories