AI Product Matcher
No credit card required
AI Product Matcher
No credit card required
Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.
Do you want to learn more about this Actor?
Get a demoWhat is the AI Product Matcher and how does it work?
The AI Product Matcher Actor uses a custom machine learning model we've developed to solve the issue of mapping products across various online stores. You can use this tool to find the same products across different online stores for the purposes of dynamic pricing, competitor analysis, and market research. You can use it both to completely replace the manual mapping of products or to make it more efficient, using different settings explained in the Input section of this readme.
In order to use the matcher, you will need to already have the datasets of products that you want to match. To get them, you can either scrape them directly on our platform (by making your own custom scraper or using one of the many available to you in our Store) or upload them to the platform using our API (with clients available in Javascript and in Python). Keep in mind that the matcher currently only works with English data.
In case you want our help with building the scrapers or even want a complete data pipeline fully managed by us and tailored to your specific use case, you can contact our enterprise team and we will be happy to help you. Meanwhile, you can start off with any of our available e-commerce scrapers:
📦 Amazon Product Scraper | 💅 AliExpress Scraper |
🔍 Google Shopping Scraper | 🛍️ Shopify Scraper |
💸 eBay Items Scraper | 🛒 Walmart Scraper |
How should the input look?
This section describes how to prepare the input for the matcher Actor. At the end of it, you can find examples of what the filled input could look like in practice. For guidance, follow our step-by-step Product Matcher tutorial 🔗 or this video tutorial:
How to specify input datasets?
There are two ways to use the Actor depending on the format your dataset is in:
- a dataset containing candidate pairs - You might already have a dataset in which each row contains information about the two products that you want the matcher to compare. In this case, you put the ids of these pair datasets in the pair_dataset_ids input (you can put multiple ids there in case you have multiple datasets that you want to be processed at once). The matcher will check each row and output a decision on whether the products in each row are the same.
- two separate datasets of products - In other use cases, you might have two separate datasets of products, each containing products from one online store. In this case, you put the ids of these datasets into the inputs dataset1_ids (for datasets of products from the online shop) and dataset2_ids (for datasets of products from the second online shop). The matcher will then look at all the possible pairings of products between the dataset and output a decision about each of them.
How to specify the input dataset format?
The next part of the Actor's input is telling the matcher what format the dataset you are giving to it is in, represented by the input_mapping Actor input. It should be a JSON object containing two attributes: eshop1 and eshop2. Each one describes under what attributes can the Actor find the necessary data for the specific online store. Each of these two attributes should contain another object that looks like this, for example:
1{ 2 "id": "productUrl", 3 "name": "productName", 4 "price": "currentPrice", 5 "short_description": "short_description", 6 "long_description": "long_description", 7 "specification": "specification", 8 "code": [ 9 "SKU", 10 "ASIN" 11 ] 12}
Each attribute of this object (such as name) specifies where each necessary attribute of the product can be found in the product dataset (continuing with the same example, it says that the product's name can be found in the attribute productName of the dataset which you gave to the matcher). These necessary attributes are as follows:
- id - unique identifier of the product. Has to be provided for the matcher to function, but isn't used as input of the machine learning model, so it can be anything convenient, such as the URL.
- name - the product's name.
- price - the current selling price of the product. Can be empty if not available. Can also contain the currency symbol (such as "$50" instead of just the number 50). However, the matcher currently disregards the currency so if you want to compare products in different currencies, you need to perform the currency conversion yourself.
- short_description - most online stores provide short (several lines at most) descriptions of the product close to the product name, price, and image. It usually describes the most important features or specifies some of the product's parameters (e.g. "32GB RAM, 500GB Hard Drive, Intel Core i3" for a laptop).
- long_description - most online stores also provide a longer description of the product, often including text provided by the manufacturer.
- specification - this should be a JSON array containing within it the product's parameters (such as weight, dimensions, components in case of electronics, color, etc.) often provided in a big table on the product's page. Each parameter should be represented by a JSON object containing properties key and value. The whole specification for one product could for example look like this:
1[ 2 { 3 "key": "RAM memory", 4 "value": "16 GB" 5 }, 6 { 7 "key": "CPU", 8 "value": "Intel Core i3" 9 }, 10 { 11 "key": "Display resolution", 12 "value": "1920:1080" 13 } 14]
- code - this attribute is special in that it allows you to specify more than one input dataset attribute, as you can see in the example above. The attributes specified should contain codes of the product if these are available, such as ASIN, SKU, EAN, etc. The more of them you can get, the better.
Since many scrapers use nested values, we also added support for them. For instance, your data might include a price attribute looking like this:
1{ 2 ... 3 "price": { 4 "value": "123.45", 5 "currency": "$" 6 }, 7 ... 8}
In that case, you can specify which nested attribute you want using the "/" path notation, same as you would for example for files. For the price example, it would look like this:
1{ 2 ... 3 "price": "price/value", 4 ... 5}
You don't always have to provide all of these attributes, some of them might not even be present on the specific e-shops you wish to use the matcher for. However, not providing them might result in degraded accuracy of the matcher. For more details, see the section about performance.
How to specify the output dataset format?
After specifying the format of the input dataset, you should specify which attributes should be included in the output dataset of the matcher. This can be done using the output_mapping Actor input, which is very similar to input_mapping, as you can see from this example:
1{ 2 "eshop1": { 3 "id_source": "productUrl", 4 "name_source": "productName" 5 }, 6 "eshop2": { 7 "id_target": "EAN", 8 "name_target": "productName" 9 } 10}
Same as before, you specify the attributes separately for each online store. Each line then specifies what will the attribute be called in the output dataset (e.g. id_source) and which attribute it was in the corresponding input dataset (e.g. productUrl). Apart from these, the output dataset will also contain two more attributes for each considered product pair:
- predicted_match - will be 1 in case the matcher thinks the two products in the considered pair are the same, 0 if not.
- predicted_scores - specifies how much the matcher thinks the two products are the same. Will be close to 1 for those that it considers to definitely be the same, and close to 0 for those it considers to definitely not be the same. This output attribute will be useful if you decide to apply your own threshold. For example, if you need to be as sure as possible, you could only take pairs with very high predicted_scores.
Precision/recall tradeoff
As mentioned in the previous section, you can either use the matcher to replace the manual mapping of products, or make it more efficient by using different settings. For this, you need to specify a setting which is generally called a precision/recall tradeoff, represented by the "Precision/recall tradeoff" input in the form, or the "precision_recall" attribute in the JSON form of input. Since no machine learning model is flawless (even humans aren't, after all), its results can contain mistakes. This setting allows you to specify which type of mistake is more of an issue to you and thus which of them the model should try to minimize. You can set it to one of two settings:
- precision - the model will try to make sure that if it marks two products as the same, they will be the same with the highest possible degree of accuracy. While this produces more reliable product pairs, it also means that more true product pairs will be marked as different products, since the model has to be more discerning to achieve higher precision.
- recall - the model will try to make sure that as many pairs of products where both products are the same are found as possible, even if it means that there will also be more pairs where the model made a mistake and the two products aren't actually the same.
For specific numbers on the performance, check the expected performance section of this readme.
Sample input for a dataset of candidate pairs
1{ 2 "pair_dataset_ids": [ 3 "Insert your dataset IDs here" 4 ], 5 "input_mapping": { 6 "eshop1": { 7 "id": "id1", 8 "name": "name1", 9 "price": "price1", 10 "short_description": "short_description1", 11 "long_description": "long_description1", 12 "specification": "specification1", 13 "code": [ 14 "SKU", 15 "ASIN" 16 ] 17 }, 18 "eshop2": { 19 "id": "id2", 20 "name": "name2", 21 "price": "price2", 22 "short_description": "short_description2", 23 "long_description": "long_description2", 24 "specification": "specification2", 25 "code": [ 26 "EAN", 27 "ASIN" 28 ] 29 } 30 }, 31 "output_mapping": { 32 "eshop1": { 33 "id_source": "id1", 34 "name_source": "name1" 35 }, 36 "eshop2": { 37 "id_target": "id2", 38 "name_target": "name2" 39 } 40 }, 41 "precision_recall": "precision" 42}
Sample input for two separate datasets of products
1{ 2 "dataset1_ids": [ 3 "Insert your dataset IDs here" 4 ], 5 "dataset2_ids": [ 6 "Insert your dataset IDs here" 7 ], 8 "input_mapping": { 9 "eshop1": { 10 "id": "url", 11 "name": "name", 12 "price": "price", 13 "short_description": "shortDescription", 14 "long_description": "longDescription", 15 "specification": "specification", 16 "code": [ 17 "SKU", 18 "ASIN" 19 ] 20 }, 21 "eshop2": { 22 "id": "productUrl", 23 "name": "name", 24 "price": "price", 25 "short_description": "shortDescription", 26 "long_description": "longDescription", 27 "specification": "specifications", 28 "code": [ 29 "EAN", 30 "ASIN" 31 ] 32 } 33 }, 34 "output_mapping": { 35 "eshop1": { 36 "id_source": "url", 37 "name_source": "name" 38 }, 39 "eshop2": { 40 "id_target": "productUrl", 41 "name_target": "name" 42 } 43 }, 44 "precision_recall": "precision" 45}
What will the results look like and where can you find them?
The results will be stored in the default dataset of the Actor run, accessible through the run page in Apify Console. You can download them both manually and through an API in a variety of formats including JSON, CSV, and Excel.
For a detailed description of the format of the results, see the output format subsection of the previous section.
How accurate is the matcher?
We have striven to make the matcher as accurate as we can, gathering thousands of manually annotated pairs of products from various categories and using them to train the model. As is the common practice, we’ve also prepared a separate dataset of product pairs that the model never saw during the training process and then estimated the model's performance using these pairs. The precise results of course depend on the precision/recall tradeoff model setting:
- With the AI model trained for precision, we measured precision of 95% (meaning that when the matcher said that two products from different e-shops were the same product, it was true in 95% of cases) with recall being 60% (meaning that the matcher found 60% of the pairs containing the same product that could be found).
- With the AI model trained for recall, we measured recall of 95% with precision being 55%.
Of course, since you will be providing your own data which will probably be coming from different online stores than we tested, the performance you see might differ, even though we tried to make the matcher as general as possible. For that reason, we recommend you do your own investigation of the accuracy of the matcher's results before you use the results for your use case. Future versions of this Actor will give you the option to use your data to train the matcher in order to alleviate this issue.
Please also note that:
- The matcher's performance is heavily dependent on what data you provide to it. Missing any of the expected data attributes might result in degraded performance, especially in case of the name, price, and code attributes.
- The matcher is trained to consider different color variants of the same product to be the same product (e.g. different color variants of the same smartphone will be considered the same product).
- The matcher's decisions might differ with different build versions of this Actor, due to changes to the underlying machine learning model we might make in the future in order to improve its general performance. If you want to make sure the matcher's decisions don't change, pick a specific build version and set the Actor to it in your account instead of the "latest" version.
How much will it cost?
This Actor is paid using the pay-per-result model, meaning you will pay a small amount for each row in the output dataset (you can find the amount per 1000 results at the top right of this Actor's detail page in Store). In this case, a result is a decision about whether a specific pair of products is the same or not. This means that the amount you pay also depends on the type of input you provided:
- a dataset of candidate pairs - this case is very simple - the number of results is the same as the number of rows in the input datasets.
- two separate datasets of products - this one is more complicated because the Actor has to try all the possible pairings of products. Meaning that if you have x products from the first online store and y products from the second, the number of results will be x*y (e.g. for 50 products from the first online store and 30 from the second, the final number of results will be 1500).
In case you want to limit the number of potential results (and thus limit how much you pay at maximum), you can set the maximum number of results in the Actor's options.
- 24 monthly users
- 9 stars
- 84.8% runs succeeded
- 49 days response time
- Created in Apr 2023
- Modified 5 months ago