E-commerce Product Matching Tool
Pricing
from $1.00 / 1,000 vector matching results
E-commerce Product Matching Tool
Match products across e-commerce datasets with E-Commerce Product Matching Tool. Use it with E-commerce Scraping Tool datasets to automatically find identical and similar products and power price monitoring or catalog comparison.
Pricing
from $1.00 / 1,000 vector matching results
Rating
0.0
(0)
Developer
Tri⟁angle
Maintained by ApifyActor stats
0
Bookmarked
3
Total users
0
Monthly active users
18 hours ago
Last modified
Categories
Share
🛒 E-Commerce Product Matching Tool
Match and compare products across any two e-commerce datasets. Find identical, similar, and related products between your own catalog and any competitor - with optional AI validation for higher-confidence results.
🧠 What it does
The E-Commerce Product Matching Tool takes two product datasets and automatically finds which products in one match products in the other. It runs each dataset through a three-stage pipeline - converting products into comparable representations, scoring every possible pair for similarity, and optionally using an AI model to validate the results and explain its reasoning.
It is designed to work with datasets collected by the E-Commerce Scraping Tool.
The tool is useful for anyone who needs to reconcile, compare, or deduplicate product information across two sources - without manually reviewing thousands of rows.
⚙️ How it works
The tool runs your two datasets through a three-stage process:
🔢 Stage 1 - Vectorization Five fields from each product are extracted and converted into numerical vectors: title, brand, category, description, and specifications. These vectors are stored in a vector database, which makes it possible to compare thousands of products in seconds based on semantic meaning - not just exact text matches. This means products can be matched even when they use different wording or formatting across retailers.
📐 Stage 2 - Similarity matching Every product in Dataset A is compared against Dataset B and assigned a similarity score from 0 to 100. You can choose to include all evaluated pairs in the output, or filter to only the pairs that meet your similarity threshold.
🤖 Stage 3 - AI validation (optional) If you enable LLM matching, an AI model reviews each candidate pair and gives a final verdict: is this a genuine match? It also provides a reasoning explanation so you can understand why it made each decision. This stage runs only on the pairs that passed the similarity threshold, which keeps costs under control.
Dataset A + Dataset B↓Vectorization↓Similarity scoring ←── threshold filter (optional)↓AI validation ←── enable with "Use LLM matching" (optional)↓Output
🚀 Before you start
You need two Apify datasets containing product data. The easiest way to collect them is with the E-Commerce Scraping Tool, which lets you scrape product listings from Amazon, Walmart, eBay, and hundreds of other retailers in a single run.
Once you have your datasets, copy their dataset IDs from Apify Console and paste them into the input fields below.
⚙️ Input
Required
| Parameter | Type | Description |
|---|---|---|
datasetIdA | string | Dataset ID for your first product list (e.g. your own catalog) |
datasetIdB | string | Dataset ID for your second product list (e.g. a competitor's catalog) |
Options
| Parameter | Type | Default | Description |
|---|---|---|---|
useLlmMatching | boolean | false | Run AI validation on similarity candidates for higher-confidence results with reasoning explanations |
vectorMatchesOnly | boolean | false | Only include product pairs that meet the similarity threshold in the output. When disabled, all evaluated pairs are returned with their scores |
maxOutputItems | number | unlimited | Stop processing after this many output items. Use this to cap cost on large datasets |
Advanced options
| Parameter | Type | Default | Description |
|---|---|---|---|
vectorSimilarityThreshold | number (0-100) | 70 | Minimum similarity score for a pair to qualify as a match. Lower values return more results with more potential false positives; higher values return fewer, more precise results |
📦 Output
Each output item represents one evaluated product pair. The output always includes the similarity assessment from Stage 2. When LLM matching is enabled, it also includes the AI verdict and reasoning from Stage 3.
📦 Output fields - similarity matching
| Field | Type | Description |
|---|---|---|
productA | object | Product data from Dataset A |
productB | object | Product data from Dataset B |
similarityScore | number | Similarity score from 0 to 100 |
is_match | boolean | Whether the pair meets the similarity threshold |
Additional fields when LLM matching is enabled
| Field | Type | Description |
|---|---|---|
llm_is_match | boolean | AI verdict: true if the model considers this a genuine product match |
llm_reasoning | string | The AI model's explanation of its verdict |
llm_relationship | string | The AI model's classification of the relationship between the two products. Possible values: "same-product", "variant", "different-product" |
llm_differences | array | List of specific differences identified by the AI model between the two products. Empty array when products are identical or near-identical |
Example output - similarity matching only
{"productA": {"title": "Apple AirPods Pro (2nd Generation)","price": 249,"brand": "Apple","url": "https://www.amazon.com/..."},"productB": {"title": "Apple AirPods Pro 2nd Gen - USB-C","price": 229,"brand": "Apple","url": "https://www.walmart.com/..."},"similarityScore": 94,"is_match": true}
Example output - with LLM matching enabled
{"productA": {"title": "Apple AirPods Pro (2nd Generation)","price": 249,"brand": "Apple","url": "https://www.amazon.com/..."},"productB": {"title": "Apple AirPods Pro 2nd Gen - USB-C","price": 229,"brand": "Apple","url": "https://www.walmart.com/..."},"similarityScore": 94,"is_match": true,"llm_is_match": true,"llm_reasoning": "Both products are the Apple AirPods Pro 2nd generation. The title variation reflects the USB-C connector variant, which is the same product sold under a slightly different listing title. Brand, model generation, and key features are identical.","llm_relationship": "same-product","llm_differences": []}
💼 Use cases
🏷️ Competitive price monitoring
Scrape your own product catalog and a competitor's catalog using the E-Commerce Scraping Tool, then run both datasets through this tool to find where the same products are priced differently. Schedule it to run weekly for ongoing price intelligence.
🗂️ Catalog deduplication
If you manage product feeds from multiple suppliers, run any two feeds through the tool to identify duplicate or near-duplicate listings before merging them into your master catalog.
🛍️ Marketplace comparison
Compare your Amazon listings against your Walmart listings to find products that exist in one place but not the other, or that have mismatched titles, prices, or descriptions across platforms.
🔄 Product feed alignment
Reconcile an internal product database against an external feed (a distributor, a retailer, or a data provider) to verify coverage and spot discrepancies.
💰 Pricing
The tool uses a pay-per-event pricing model - you are charged based on the number of product pairs processed, not for the run itself.
Controlling costs
- Set
maxOutputItemsto cap the number of pairs processed in a single run. The tool stops as soon as the limit is reached, so your cost is fully bounded. - Use
vectorMatchesOnly: trueto filter early - only pairs that pass the similarity threshold proceed to output (and to LLM validation if enabled), which reduces cost on datasets with low match rates. - LLM matching adds cost per validated item. Disable it if the similarity score alone gives you sufficient signal for your use case.
- Run a small test with a sample of each dataset to calibrate your similarity threshold before processing the full dataset.
🔗 API integration
JavaScript
import { ApifyClient } from 'apify-client';const client = new ApifyClient({token: '<YOUR_API_TOKEN>',});const input = {datasetIdA: '<YOUR_FIRST_DATASET_ID>',datasetIdB: '<YOUR_SECOND_DATASET_ID>',useLlmMatching: true,vectorMatchesOnly: true,vectorSimilarityThreshold: 70,maxOutputItems: 1000,};const run = await client.actor('tri_angle/e-commerce-product-matching-tool').call(input);console.log(`Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
Python
from apify_client import ApifyClientclient = ApifyClient('<YOUR_API_TOKEN>')run_input = {'datasetIdA': '<YOUR_FIRST_DATASET_ID>','datasetIdB': '<YOUR_SECOND_DATASET_ID>','useLlmMatching': True,'vectorMatchesOnly': True,'vectorSimilarityThreshold': 70,'maxOutputItems': 1000,}run = client.actor('tri_angle/e-commerce-product-matching-tool').call(run_input=run_input)print('Check your data here: https://console.apify.com/storage/datasets/' + run['defaultDatasetId'])
CLI
echo '{"datasetIdA": "<YOUR_FIRST_DATASET_ID>","datasetIdB": "<YOUR_SECOND_DATASET_ID>","useLlmMatching": true,"vectorMatchesOnly": true,"vectorSimilarityThreshold": 70,"maxOutputItems": 1000}' |apify call tri_angle/e-commerce-product-matching-tool --input-file - --silent --output-dataset
🚀 Getting started
- Collect two product datasets - use the E-Commerce Scraping Tool or any Apify scraper that returns product data
- Find each dataset's ID in Apify Console under Storage > Datasets
- Open the E-Commerce Product Matching Tool and paste both dataset IDs into the input
- Choose your options: enable LLM matching for higher confidence, or keep it off for faster, lower-cost results
- Click Start and wait for results - the run time depends on dataset size and whether LLM matching is enabled
- Download your output as JSON, CSV, or Excel, or connect it to your data pipeline via the API
❓ FAQ
Do I have to use E-Commerce Scraping Tool to collect the data? No. Any Apify dataset that contains product data works as input. E-Commerce Scraping Tool output is natively compatible, but you can use data from any source as long as it's stored in an Apify dataset.
How accurate is the matching? Similarity matching works well for products with consistent names, brands, or standard identifiers (like EAN or UPC). For products with ambiguous or highly variable descriptions, enable LLM matching - the AI model reads the full product context and provides a verdict with reasoning, which significantly improves accuracy.
What similarity threshold should I use? The default of 70 is a good starting point for most cases. Lower it (e.g. 50-60) if you want more results and are willing to review some false positives. Raise it (e.g. 85-90) if you want only very high-confidence matches. Test with a small dataset sample first.
Can I match more than two datasets at once? Not in a single run. To compare three datasets, run the tool twice: A vs. B, then B vs. C (or A vs. C). Each run produces a separate output dataset.
How do I control costs on large datasets?
Set maxOutputItems to a number that fits your budget. The tool stops processing as soon as that limit is reached, so your cost is fully bounded. You can also use vectorMatchesOnly: true to skip outputting low-similarity pairs, which reduces the number of items LLM matching needs to process.
Can I schedule this to run automatically? Yes. Use Apify's built-in scheduler to run the tool on a recurring basis - daily, weekly, or at any custom interval. Combine it with the E-Commerce Scraping Tool on a matching schedule to keep your match data up to date automatically.
What export formats are available? Output is available as JSON, CSV, Excel, XML, and HTML. You can also connect directly to the output dataset via the Apify API or integrate with tools like Google Sheets, Zapier, n8n, and others.