Advanced Product Matcher Pro
Pricing
$0.10 / 1,000 results
Advanced Product Matcher Pro
A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.
0.0 (0)
Pricing
$0.10 / 1,000 results
0
2
2
Last modified
a day ago
AI Product Matcher Actor
A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation.
Features
- Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets
- Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore
- Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching
- Configurable Attributes: Weight different product attributes based on importance
- Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization
- Performance Optimization: Group products by categories or other attributes for faster processing
- Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models
- Flexible Output: Customizable match results with similarity scores, original values, and additional output fields
- Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors
Quick Start
Basic Configuration Example
{"dataFormat": "csv","dataSource": "datasets","dataset1": "catalog_products","dataset1Name": "Catalog","dataset1PrimaryKey": "ProductId","dataset2": "retailer_products","dataset2Name": "Retailer","dataset2PrimaryKey": "ProductId","threshold": 0.7,"maxMatches": 2,"language": "en","groupByAttribute": "category","csvSeparator": ",","includeOriginalValues": true,"attributes": [{"name": "title","weight": 1.0,"useForMatching": true},{"name": "brand","weight": 0.8,"useForMatching": true},{"name": "price","weight": 0.3,"useForMatching": false}]}
Core Input Parameters
Parameter | Type | Description | Default |
---|---|---|---|
dataFormat | string | Data format: "csv" or "json" | "json" |
dataSource | string | Source type: "datasets" or "keyvaluestore" | "datasets" |
keyValuestoreNameOrId | string | Name or ID of KeyValueStore (if dataSource: keyvaluestore ) | none |
dataset1 | string | First dataset key/ID (CSV filename or Dataset ID) | required |
dataset1Name | string | Friendly name for dataset 1 | "Dataset1" |
dataset1PrimaryKey | string | Primary key field name in dataset 1 | "ProductId" |
dataset2 | string | Second dataset key/ID | required |
dataset2Name | string | Friendly name for dataset 2 | "Dataset2" |
dataset2PrimaryKey | string | Primary key field name in dataset 2 | "ProductId" |
threshold | number | Minimum overall similarity score for matches (0.0–1.0) | 0.5 |
maxMatches | integer | Maximum number of matches returned per item | 2 |
language | string | Embedding model selection: "en" , "multilingual" , "es" , "fr" , "de" , "it" , "pt" , "nl" | "en" |
groupByAttribute | string | Attribute name to group by for efficient matching (optional) | none |
csvSeparator | string | CSV delimiter (only when dataFormat: csv ) | "," |
includeOriginalValues | boolean | Include original attribute values in the output records | true |
dataset1OutputFields | array | Include specific attribute values in the output records from dataset 1 | ["Field1"] |
dataset2OutputFields | array | Include specific attribute values in the output records from dataset 2 | ["Field1", "Field2"] |
attributes | array | Required. List of attribute configurations (see below) | required |
Attribute Configuration
Each attribute in attributes
supports:
name
(string, required) — Column name (CSV) or attribute key (JSON)weight
(number) — Importance weight for matching (higher = more important)useForMatching
(boolean) — Whether to include in similarity calculationjsonPath
(string) — JSON path expression for nested datawordsToRemove
(array) — List of words to strip before matchingwordReplacements
(object) — Mapping of terms to replace prior to matchingregex
(string) — Regex to apply during preprocessingnormalizationRegex
(string) — Regex applied before similarity calculationnormalizationReplacement
(string) — Replacement for normalization regex
Text Preprocessing example
{"name": "brand","weight": 0.8,"useForMatching": true,"wordsToRemove": ["inc", "llc", "ltd", "corp"],"wordReplacements": {"apple": "apple inc","samsung": "samsung electronics"},"regex": "\\b(inc|llc|ltd|corp)\\b","normalizationRegex": "[^a-zA-Z0-9\\s]","normalizationReplacement": ""}
Property | Type | Description |
---|---|---|
wordsToRemove | array | Words to remove from text |
wordReplacements | object | Word substitution mapping |
regex | string | Regex pattern for text cleaning |
normalizationRegex | string | Regex for similarity calculation normalization |
normalizationReplacement | string | Replacement for normalization regex |
Real-World Examples
1. E-commerce Catalog Matching
{"dataFormat": "csv","dataSource": "datasets","dataset1": "manufacturer_catalog.csv","dataset1Name": "Manufacturer","dataset1PrimaryKey": "ProductId","dataset2": "retailer_inventory.csv","dataset2Name": "Retailer","dataset2PrimaryKey": "ProductId","threshold": 0.75,"maxMatches": 3,"language": "en","groupByAttribute": "category","attributes": [{"name": "product_name","weight": 1.5,"useForMatching": true,"wordsToRemove": ["new", "original", "authentic"],"wordReplacements": {"&": "and", "w/": "with"}},{"name": "brand","weight": 1.2,"useForMatching": true,"wordsToRemove": ["inc", "llc", "corp"],"wordReplacements": {"apple": "apple inc", "hp": "hewlett packard"}},{"name": "model_number","weight": 1.8,"useForMatching": true,"normalizationRegex": "[^A-Za-z0-9]","normalizationReplacement": ""},{"name": "price","weight": 0.3,"useForMatching": false,"regex": "\\D"}]}
2. Fashion Product Matching with Complex JSON
Matching fashion products from different suppliers with nested JSON data:
{"dataFormat": "json","dataSource": "datasets","dataset1": "fashion_supplier_a","dataset1Name": "SupplierA","dataset1PrimaryKey": "ID","dataset2": "fashion_supplier_b","dataset2Name": "SupplierB","dataset2PrimaryKey": "ID","threshold": 0.65,"language": "multilingual","maxMatches": 2,"attributes": [{"name": "Color","jsonPath": "ProductAttributes[Type=Color].Value","weight": 1.5,"useForMatching": true,"wordReplacements": {"gray": "grey", "navy": "navy blue"}},{"name": "Size","jsonPath": "ProductAttributes[Type=Size].Value","weight": 1.8,"useForMatching": true,"wordsToRemove": ["size", "us", "eu"],"normalizationRegex": "[^0-9XLS]","normalizationReplacement": ""},{"name": "Material","jsonPath": "Details.Fabric.Primary","weight": 1.2,"useForMatching": true}],"includeOriginalValues": false}
Example 3: Home & Garden Products
{"dataFormat": "json","dataSource": "dataset","dataset1": "bedbath","dataset1Name": "BedBath","dataset1PrimaryKey": "ProductId","dataset2": "overstock","dataset2Name": "Overstock","dataset2PrimaryKey": "ProductId","threshold": "0.6","language": "en","csvSeparator": ",","groupByAttribute": "Model","maxMatches": 3,"attributes": [{"name": "Model","jsonPath": "AdhocDataAttributes[Name=Model].value","weight": 1,"useForMatching": false},{"name": "Color","jsonPath": "AdhocDataAttributes[Name=Color].value","weight": 2,"useForMatching": true,"wordReplacements": {"gray": "grey","/": " "}},{"name": "Size","jsonPath": "AdhocDataAttributes[Name=Size].value","weight": 3,"useForMatching": true,"regex": "\\D"},{"name": "Shape","jsonPath": "AdhocDataAttributes[Name=Shape].value","weight": 1,"useForMatching": true}],"dataset1OutputFields": ["Address","ProductName"]}
Advanced Configuration
JSON Path Expressions
- Dot notation:
"product.details.name"
- Array search:
"Attributes[Name=Color].Value"
- Nested arrays/objects for complex structures
Complex Nested Structures
{"ProductAttributes": [{"Type": "Color", "Value": "Red"},{"Type": "Size", "Value": "Large"},{"Type": "Material", "Value": "Cotton"}],"Details": {"Pricing": {"MSRP": 29.99, "Sale": 19.99},"Specifications": {"Weight": "2.5 lbs"}}}
Corresponding JSON paths:
- Color:
"ProductAttributes[Type=Color].Value"
- Size:
"ProductAttributes[Type=Size].Value"
- MSRP:
"Details.Pricing.MSRP"
- Weight:
"Details.Specifications.Weight"
Regular Expression Patterns
- Size cleaning: remove non-digits
{"regex": "\\D"}
- Model normalization: keep alphanumeric
{"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""}
- Price extraction: strip currency symbols
{"regex": "[^0-9.]"}
Size Normalization
{"name": "size","regex": "\\D","normalizationRegex": "[^0-9XLS]","normalizationReplacement": ""}
regex
: Removes all non-digit characters during preprocessingnormalizationRegex
: For similarity calculation, keeps only numbers and X, L, S
Model Number Cleaning
{"name": "model","regex": "\\b(model|version|v\\d+)\\b","normalizationRegex": "[^a-zA-Z0-9]","normalizationReplacement": ""}
- Removes common model prefixes
- Normalizes to alphanumeric only for comparison
Price Extraction
{"name": "price","regex": "[^0-9.]","normalizationRegex": "\\$|,","normalizationReplacement": ""}
- Extracts numeric price values
- Removes currency symbols and commas
Brand Standardization
{"name": "brand","regex": "\\b(inc|llc|ltd|corp|company)\\b","wordReplacements": {"apple": "apple inc","hp": "hewlett packard","ms": "microsoft"}}
Performance Optimization
- Grouping by attribute reduces N×M comparisons to subsets
- Note Ensure the group by field if in nested JSON is also included in the attributes
- Use English model (
all-MiniLM-L6-v2
) for English-only to speed up - Limit
maxMatches
for large catalogs - Disable matching (
useForMatching: false
) on grouping fields
Grouping Strategy
Use groupByAttribute
to partition products into smaller groups:
{"groupByAttribute": "category","attributes": [{"name": "category","weight": 0.5,"useForMatching": false}]}
Benefits:
- Reduces comparison matrix size from N×M to smaller subsets
- Improves processing speed significantly for large datasets
- More accurate matches within similar product categories
Language Model Selection
Choose appropriate models based on your data:
- English:
"en"
- Fastest, best for English-only data - Multilingual:
"multilingual"
- Slower but handles mixed languages - Specific Languages:
"es"
,"fr"
,"de"
- Optimized for specific languages
Output Format
The Actor generates matches with the following structure:
{"Dataset1ProductId": "PROD123","Dataset2ProductId": "SKU456","overallSimilarity": 0.85,"titleSimilarity": 0.92,"brandSimilarity": 1.0,"colorSimilarity": 0.75,"Dataset1Title": "Apple iPhone 13 Pro","Dataset2Title": "iPhone 13 Pro - Apple","Dataset1Brand": "Apple","Dataset2Brand": "Apple Inc"}
Reading the SUMMARY
After execution, a SUMMARY
record is saved to KeyValueStore containing:
- Total products per dataset
- Number of matches and unique matches
- Match rate
- Model and data format used
- Any collected errors with
type
,code
,message
, andsuggestions
Review this summary to diagnose configuration or data issues quickly.
Best Practices
- Attribute Weighting:
- High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs)
- Medium Weight (0.8-1.2): Important descriptors (brand, title)
- Low Weight (0.3-0.7): Secondary attributes (color, price)
- Threshold Selection:
- High Precision (0.8-0.9): Few false positives, may miss some matches
- Balanced (0.6-0.8): Good balance of precision and recall
- High Recall (0.4-0.6): Catches more matches, requires manual review
- Text Preprocessing:
- Start with simple
wordReplacements
- Add
regex
for cleaning patterns - Use
normalizationRegex
only for similarity calculation - Validate on sample data
- Scaling to Large Datasets:
- Always use
groupByAttribute
when > 10,000 items - Adjust
maxMatches
and disable output of original values to reduce output dataset size
- Always use
Troubleshooting & Error Handling
Common Issues
- No matches found
- Lower the
threshold
value - Verify attribute names and JSON paths
- Adjust text preprocessing rules
- Lower the
- Too many false positives
- Increase
threshold
to 0.8–0.9 - Add stricter
wordsToRemove
or regex - Increase weights for unique identifiers
- Increase
- Performance bottlenecks
- Enable
groupByAttribute
for large datasets - Use the English model for English-only data
- Reduce
maxMatches
- Enable
Error Types
This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY.
Error Class | Code | Description |
---|---|---|
InputValidationError | PME-100 | Schema or type validation failed for actor input |
DataLoadingError | PME-200 | CSV/JSON file not found, unreadable, or unparseable |
AttributeConfigError | PME-300 | Issues in the attributes section (missing columns, bad JSON paths, invalid weights) |
ModelLoadingError | PME-400 | Sentence-Transformer model fetch or cache failure |
ProcessingError | PME-500 | Failures during matching workflow (e.g., zero vectors, similarity computation errors) |
On this page
Share Actor: