Advanced Product Matcher Pro avatar
Advanced Product Matcher Pro

Pricing

$0.10 / 1,000 results

Go to Store
Advanced Product Matcher Pro

Advanced Product Matcher Pro

Developed by

Whisperers

Whisperers

Maintained by Community

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.

0.0 (0)

Pricing

$0.10 / 1,000 results

0

2

2

Last modified

a day ago

AI Product Matcher Actor

A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation.

Features

  • Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets
  • Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore
  • Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching
  • Configurable Attributes: Weight different product attributes based on importance
  • Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization
  • Performance Optimization: Group products by categories or other attributes for faster processing
  • Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models
  • Flexible Output: Customizable match results with similarity scores, original values, and additional output fields
  • Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors

Quick Start

Basic Configuration Example

{
"dataFormat": "csv",
"dataSource": "datasets",
"dataset1": "catalog_products",
"dataset1Name": "Catalog",
"dataset1PrimaryKey": "ProductId",
"dataset2": "retailer_products",
"dataset2Name": "Retailer",
"dataset2PrimaryKey": "ProductId",
"threshold": 0.7,
"maxMatches": 2,
"language": "en",
"groupByAttribute": "category",
"csvSeparator": ",",
"includeOriginalValues": true,
"attributes": [
{
"name": "title",
"weight": 1.0,
"useForMatching": true
},
{
"name": "brand",
"weight": 0.8,
"useForMatching": true
},
{
"name": "price",
"weight": 0.3,
"useForMatching": false
}
]
}

Core Input Parameters

ParameterTypeDescriptionDefault
dataFormatstringData format: "csv" or "json""json"
dataSourcestringSource type: "datasets" or "keyvaluestore""datasets"
keyValuestoreNameOrIdstringName or ID of KeyValueStore (if dataSource: keyvaluestore)none
dataset1stringFirst dataset key/ID (CSV filename or Dataset ID)required
dataset1NamestringFriendly name for dataset 1"Dataset1"
dataset1PrimaryKeystringPrimary key field name in dataset 1"ProductId"
dataset2stringSecond dataset key/IDrequired
dataset2NamestringFriendly name for dataset 2"Dataset2"
dataset2PrimaryKeystringPrimary key field name in dataset 2"ProductId"
thresholdnumberMinimum overall similarity score for matches (0.0–1.0)0.5
maxMatchesintegerMaximum number of matches returned per item2
languagestringEmbedding model selection: "en", "multilingual", "es", "fr", "de", "it", "pt", "nl""en"
groupByAttributestringAttribute name to group by for efficient matching (optional)none
csvSeparatorstringCSV delimiter (only when dataFormat: csv)","
includeOriginalValuesbooleanInclude original attribute values in the output recordstrue
dataset1OutputFieldsarrayInclude specific attribute values in the output records from dataset 1["Field1"]
dataset2OutputFieldsarrayInclude specific attribute values in the output records from dataset 2["Field1", "Field2"]
attributesarrayRequired. List of attribute configurations (see below)required

Attribute Configuration

Each attribute in attributes supports:

  • name (string, required) — Column name (CSV) or attribute key (JSON)
  • weight (number) — Importance weight for matching (higher = more important)
  • useForMatching (boolean) — Whether to include in similarity calculation
  • jsonPath (string) — JSON path expression for nested data
  • wordsToRemove (array) — List of words to strip before matching
  • wordReplacements (object) — Mapping of terms to replace prior to matching
  • regex (string) — Regex to apply during preprocessing
  • normalizationRegex (string) — Regex applied before similarity calculation
  • normalizationReplacement (string) — Replacement for normalization regex

Text Preprocessing example

{
"name": "brand",
"weight": 0.8,
"useForMatching": true,
"wordsToRemove": ["inc", "llc", "ltd", "corp"],
"wordReplacements": {
"apple": "apple inc",
"samsung": "samsung electronics"
},
"regex": "\\b(inc|llc|ltd|corp)\\b",
"normalizationRegex": "[^a-zA-Z0-9\\s]",
"normalizationReplacement": ""
}
PropertyTypeDescription
wordsToRemovearrayWords to remove from text
wordReplacementsobjectWord substitution mapping
regexstringRegex pattern for text cleaning
normalizationRegexstringRegex for similarity calculation normalization
normalizationReplacementstringReplacement for normalization regex

Real-World Examples

1. E-commerce Catalog Matching

{
"dataFormat": "csv",
"dataSource": "datasets",
"dataset1": "manufacturer_catalog.csv",
"dataset1Name": "Manufacturer",
"dataset1PrimaryKey": "ProductId",
"dataset2": "retailer_inventory.csv",
"dataset2Name": "Retailer",
"dataset2PrimaryKey": "ProductId",
"threshold": 0.75,
"maxMatches": 3,
"language": "en",
"groupByAttribute": "category",
"attributes": [
{
"name": "product_name",
"weight": 1.5,
"useForMatching": true,
"wordsToRemove": ["new", "original", "authentic"],
"wordReplacements": {"&": "and", "w/": "with"}
},
{
"name": "brand",
"weight": 1.2,
"useForMatching": true,
"wordsToRemove": ["inc", "llc", "corp"],
"wordReplacements": {"apple": "apple inc", "hp": "hewlett packard"}
},
{
"name": "model_number",
"weight": 1.8,
"useForMatching": true,
"normalizationRegex": "[^A-Za-z0-9]",
"normalizationReplacement": ""
},
{
"name": "price",
"weight": 0.3,
"useForMatching": false,
"regex": "\\D"
}
]
}

2. Fashion Product Matching with Complex JSON

Matching fashion products from different suppliers with nested JSON data:

{
"dataFormat": "json",
"dataSource": "datasets",
"dataset1": "fashion_supplier_a",
"dataset1Name": "SupplierA",
"dataset1PrimaryKey": "ID",
"dataset2": "fashion_supplier_b",
"dataset2Name": "SupplierB",
"dataset2PrimaryKey": "ID",
"threshold": 0.65,
"language": "multilingual",
"maxMatches": 2,
"attributes": [
{
"name": "Color",
"jsonPath": "ProductAttributes[Type=Color].Value",
"weight": 1.5,
"useForMatching": true,
"wordReplacements": {"gray": "grey", "navy": "navy blue"}
},
{
"name": "Size",
"jsonPath": "ProductAttributes[Type=Size].Value",
"weight": 1.8,
"useForMatching": true,
"wordsToRemove": ["size", "us", "eu"],
"normalizationRegex": "[^0-9XLS]",
"normalizationReplacement": ""
},
{
"name": "Material",
"jsonPath": "Details.Fabric.Primary",
"weight": 1.2,
"useForMatching": true
}
],
"includeOriginalValues": false
}

Example 3: Home & Garden Products

{
"dataFormat": "json",
"dataSource": "dataset",
"dataset1": "bedbath",
"dataset1Name": "BedBath",
"dataset1PrimaryKey": "ProductId",
"dataset2": "overstock",
"dataset2Name": "Overstock",
"dataset2PrimaryKey": "ProductId",
"threshold": "0.6",
"language": "en",
"csvSeparator": ",",
"groupByAttribute": "Model",
"maxMatches": 3,
"attributes": [
{
"name": "Model",
"jsonPath": "AdhocDataAttributes[Name=Model].value",
"weight": 1,
"useForMatching": false
},
{
"name": "Color",
"jsonPath": "AdhocDataAttributes[Name=Color].value",
"weight": 2,
"useForMatching": true,
"wordReplacements": {
"gray": "grey",
"/": " "
}
},
{
"name": "Size",
"jsonPath": "AdhocDataAttributes[Name=Size].value",
"weight": 3,
"useForMatching": true,
"regex": "\\D"
},
{
"name": "Shape",
"jsonPath": "AdhocDataAttributes[Name=Shape].value",
"weight": 1,
"useForMatching": true
}
],
"dataset1OutputFields": [
"Address",
"ProductName"
]
}

Advanced Configuration

JSON Path Expressions

  • Dot notation: "product.details.name"
  • Array search: "Attributes[Name=Color].Value"
  • Nested arrays/objects for complex structures

Complex Nested Structures

{
"ProductAttributes": [
{"Type": "Color", "Value": "Red"},
{"Type": "Size", "Value": "Large"},
{"Type": "Material", "Value": "Cotton"}
],
"Details": {
"Pricing": {"MSRP": 29.99, "Sale": 19.99},
"Specifications": {"Weight": "2.5 lbs"}
}
}

Corresponding JSON paths:

  • Color: "ProductAttributes[Type=Color].Value"
  • Size: "ProductAttributes[Type=Size].Value"
  • MSRP: "Details.Pricing.MSRP"
  • Weight: "Details.Specifications.Weight"

Regular Expression Patterns

  • Size cleaning: remove non-digits {"regex": "\\D"}
  • Model normalization: keep alphanumeric {"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""}
  • Price extraction: strip currency symbols {"regex": "[^0-9.]"}

Size Normalization

{
"name": "size",
"regex": "\\D",
"normalizationRegex": "[^0-9XLS]",
"normalizationReplacement": ""
}
  • regex: Removes all non-digit characters during preprocessing
  • normalizationRegex: For similarity calculation, keeps only numbers and X, L, S

Model Number Cleaning

{
"name": "model",
"regex": "\\b(model|version|v\\d+)\\b",
"normalizationRegex": "[^a-zA-Z0-9]",
"normalizationReplacement": ""
}
  • Removes common model prefixes
  • Normalizes to alphanumeric only for comparison

Price Extraction

{
"name": "price",
"regex": "[^0-9.]",
"normalizationRegex": "\\$|,",
"normalizationReplacement": ""
}
  • Extracts numeric price values
  • Removes currency symbols and commas

Brand Standardization

{
"name": "brand",
"regex": "\\b(inc|llc|ltd|corp|company)\\b",
"wordReplacements": {
"apple": "apple inc",
"hp": "hewlett packard",
"ms": "microsoft"
}
}

Performance Optimization

  • Grouping by attribute reduces N×M comparisons to subsets
    • Note Ensure the group by field if in nested JSON is also included in the attributes
  • Use English model (all-MiniLM-L6-v2) for English-only to speed up
  • Limit maxMatches for large catalogs
  • Disable matching (useForMatching: false) on grouping fields

Grouping Strategy

Use groupByAttribute to partition products into smaller groups:

{
"groupByAttribute": "category",
"attributes": [
{
"name": "category",
"weight": 0.5,
"useForMatching": false
}
]
}

Benefits:

  • Reduces comparison matrix size from N×M to smaller subsets
  • Improves processing speed significantly for large datasets
  • More accurate matches within similar product categories

Language Model Selection

Choose appropriate models based on your data:

  • English: "en" - Fastest, best for English-only data
  • Multilingual: "multilingual" - Slower but handles mixed languages
  • Specific Languages: "es", "fr", "de" - Optimized for specific languages

Output Format

The Actor generates matches with the following structure:

{
"Dataset1ProductId": "PROD123",
"Dataset2ProductId": "SKU456",
"overallSimilarity": 0.85,
"titleSimilarity": 0.92,
"brandSimilarity": 1.0,
"colorSimilarity": 0.75,
"Dataset1Title": "Apple iPhone 13 Pro",
"Dataset2Title": "iPhone 13 Pro - Apple",
"Dataset1Brand": "Apple",
"Dataset2Brand": "Apple Inc"
}

Reading the SUMMARY

After execution, a SUMMARY record is saved to KeyValueStore containing:

  • Total products per dataset
  • Number of matches and unique matches
  • Match rate
  • Model and data format used
  • Any collected errors with type, code, message, and suggestions

Review this summary to diagnose configuration or data issues quickly.

Best Practices

  • Attribute Weighting:
    • High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs)
    • Medium Weight (0.8-1.2): Important descriptors (brand, title)
    • Low Weight (0.3-0.7): Secondary attributes (color, price)
  • Threshold Selection:
    • High Precision (0.8-0.9): Few false positives, may miss some matches
    • Balanced (0.6-0.8): Good balance of precision and recall
    • High Recall (0.4-0.6): Catches more matches, requires manual review
  • Text Preprocessing:
  1. Start with simple wordReplacements
  2. Add regex for cleaning patterns
  3. Use normalizationRegex only for similarity calculation
  4. Validate on sample data
  • Scaling to Large Datasets:
    • Always use groupByAttribute when > 10,000 items
    • Adjust maxMatches and disable output of original values to reduce output dataset size

Troubleshooting & Error Handling

Common Issues

  • No matches found
    • Lower the threshold value
    • Verify attribute names and JSON paths
    • Adjust text preprocessing rules
  • Too many false positives
    • Increase threshold to 0.8–0.9
    • Add stricter wordsToRemove or regex
    • Increase weights for unique identifiers
  • Performance bottlenecks
    • Enable groupByAttribute for large datasets
    • Use the English model for English-only data
    • Reduce maxMatches

Error Types

This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY.

Error ClassCodeDescription
InputValidationErrorPME-100Schema or type validation failed for actor input
DataLoadingErrorPME-200CSV/JSON file not found, unreadable, or unparseable
AttributeConfigErrorPME-300Issues in the attributes section (missing columns, bad JSON paths, invalid weights)
ModelLoadingErrorPME-400Sentence-Transformer model fetch or cache failure
ProcessingErrorPME-500Failures during matching workflow (e.g., zero vectors, similarity computation errors)