Advanced Product Matcher Pro

Pricing

$0.10 / 1,000 results

Try for free

Go to Apify Store

Advanced Product Matcher Pro

Try for free

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.

Pricing

$0.10 / 1,000 results

Rating

5.0

(1)

Developer

Whisperers

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

AI Product Matcher Actor

A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation.

Features

Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets
Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore
Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching
Configurable Attributes: Weight different product attributes based on importance
Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization
Performance Optimization: Group products by categories or other attributes for faster processing
Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models
Flexible Output: Customizable match results with similarity scores, original values, and additional output fields
Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors

Quick Start

Basic Configuration Example

{
  "dataFormat": "csv",
  "dataSource": "datasets",
  "dataset1": "catalog_products",
  "dataset1Name": "Catalog",
  "dataset1PrimaryKey": "ProductId",
  "dataset2": "retailer_products",
  "dataset2Name": "Retailer",
  "dataset2PrimaryKey": "ProductId",
  "threshold": 0.7,
  "maxMatches": 2,
  "language": "en",
  "groupByAttribute": "category",
  "csvSeparator": ",",
  "includeOriginalValues": true,
  "attributes": [
    {
      "name": "title",
      "weight": 1.0,
      "useForMatching": true
    },
    {
      "name": "brand",
      "weight": 0.8,
      "useForMatching": true
    },
    {
      "name": "price",
      "weight": 0.3,
      "useForMatching": false
    }
  ]
}

Core Input Parameters

Parameter	Type	Description	Default
`dataFormat`	string	Data format: `"csv"` or `"json"`	`"json"`
`dataSource`	string	Source type: `"datasets"` or `"keyvaluestore"`	`"datasets"`
`keyValuestoreNameOrId`	string	Name or ID of KeyValueStore (if `dataSource: keyvaluestore`)	none
`dataset1`	string	First dataset key/ID (CSV filename or Dataset ID)	required
`dataset1Name`	string	Friendly name for dataset 1	`"Dataset1"`
`dataset1PrimaryKey`	string	Primary key field name in dataset 1	`"ProductId"`
`dataset2`	string	Second dataset key/ID	required
`dataset2Name`	string	Friendly name for dataset 2	`"Dataset2"`
`dataset2PrimaryKey`	string	Primary key field name in dataset 2	`"ProductId"`
`threshold`	number	Minimum overall similarity score for matches (0.0–1.0)	`0.5`
`maxMatches`	integer	Maximum number of matches returned per item	`2`
`language`	string	Embedding model selection: `"en"`, `"multilingual"`, `"es"`, `"fr"`, `"de"`, `"it"`, `"pt"`, `"nl"`	`"en"`
`groupByAttribute`	string	Attribute name to group by for efficient matching (optional)	none
`csvSeparator`	string	CSV delimiter (only when `dataFormat: csv`)	`","`
`includeOriginalValues`	boolean	Include original attribute values in the output records	`true`
`dataset1OutputFields`	array	Include specific attribute values in the output records from dataset 1	`["Field1"]`
`dataset2OutputFields`	array	Include specific attribute values in the output records from dataset 2	`["Field1", "Field2"]`
`attributes`	array	Required. List of attribute configurations (see below)	required

Attribute Configuration

Each attribute in attributes supports:

name (string, required) — Column name (CSV) or attribute key (JSON)
weight (number) — Importance weight for matching (higher = more important)
useForMatching (boolean) — Whether to include in similarity calculation
jsonPath (string) — JSON path expression for nested data
wordsToRemove (array) — List of words to strip before matching
wordReplacements (object) — Mapping of terms to replace prior to matching
regex (string) — Regex to apply during preprocessing
normalizationRegex (string) — Regex applied before similarity calculation
normalizationReplacement (string) — Replacement for normalization regex

Text Preprocessing example

{
  "name": "brand",
  "weight": 0.8,
  "useForMatching": true,
  "wordsToRemove": ["inc", "llc", "ltd", "corp"],
  "wordReplacements": {
    "apple": "apple inc",
    "samsung": "samsung electronics"
  },
  "regex": "\\b(inc|llc|ltd|corp)\\b",
  "normalizationRegex": "[^a-zA-Z0-9\\s]",
  "normalizationReplacement": ""
}

Property	Type	Description
`wordsToRemove`	array	Words to remove from text
`wordReplacements`	object	Word substitution mapping
`regex`	string	Regex pattern for text cleaning
`normalizationRegex`	string	Regex for similarity calculation normalization
`normalizationReplacement`	string	Replacement for normalization regex

Real-World Examples

1. E-commerce Catalog Matching

{
  "dataFormat": "csv",
  "dataSource": "datasets",
  "dataset1": "manufacturer_catalog.csv",
  "dataset1Name": "Manufacturer",
  "dataset1PrimaryKey": "ProductId",
  "dataset2": "retailer_inventory.csv",
  "dataset2Name": "Retailer",
  "dataset2PrimaryKey": "ProductId",
  "threshold": 0.75,
  "maxMatches": 3,
  "language": "en",
  "groupByAttribute": "category",
  "attributes": [
    {
      "name": "product_name",
      "weight": 1.5,
      "useForMatching": true,
      "wordsToRemove": ["new", "original", "authentic"],
      "wordReplacements": {"&amp;": "and", "w/": "with"}
    },
    {
      "name": "brand",
      "weight": 1.2,
      "useForMatching": true,
      "wordsToRemove": ["inc", "llc", "corp"],
      "wordReplacements": {"apple": "apple inc", "hp": "hewlett packard"}
    },
    {
      "name": "model_number",
      "weight": 1.8,
      "useForMatching": true,
      "normalizationRegex": "[^A-Za-z0-9]",
      "normalizationReplacement": ""
    },
    {
      "name": "price",
      "weight": 0.3,
      "useForMatching": false,
      "regex": "\\D"
    }
  ]
}

2. Fashion Product Matching with Complex JSON

Matching fashion products from different suppliers with nested JSON data:

{
  "dataFormat": "json",
  "dataSource": "datasets",
  "dataset1": "fashion_supplier_a",
  "dataset1Name": "SupplierA",
  "dataset1PrimaryKey": "ID",
  "dataset2": "fashion_supplier_b",
  "dataset2Name": "SupplierB",
  "dataset2PrimaryKey": "ID",
  "threshold": 0.65,
  "language": "multilingual",
  "maxMatches": 2,
  "attributes": [
    {
      "name": "Color",
      "jsonPath": "ProductAttributes[Type=Color].Value",
      "weight": 1.5,
      "useForMatching": true,
      "wordReplacements": {"gray": "grey", "navy": "navy blue"}
    },
    {
      "name": "Size",
      "jsonPath": "ProductAttributes[Type=Size].Value",
      "weight": 1.8,
      "useForMatching": true,
      "wordsToRemove": ["size", "us", "eu"],
      "normalizationRegex": "[^0-9XLS]",
      "normalizationReplacement": ""
    },
    {
      "name": "Material",
      "jsonPath": "Details.Fabric.Primary",
      "weight": 1.2,
      "useForMatching": true
    }
  ],
  "includeOriginalValues": false
}

Example 3: Home & Garden Products

{
  "dataFormat": "json",
  "dataSource": "dataset",
  "dataset1": "bedbath",
  "dataset1Name": "BedBath",
  "dataset1PrimaryKey": "ProductId",
  "dataset2": "overstock",
  "dataset2Name": "Overstock",
  "dataset2PrimaryKey": "ProductId",
  "threshold": "0.6",
  "language": "en",
  "csvSeparator": ",",
  "groupByAttribute": "Model",
  "maxMatches": 3,
  "attributes": [
    {
      "name": "Model",
      "jsonPath": "AdhocDataAttributes[Name=Model].value",
      "weight": 1,
      "useForMatching": false
    },
    {
      "name": "Color",
      "jsonPath": "AdhocDataAttributes[Name=Color].value",
      "weight": 2,
      "useForMatching": true,
      "wordReplacements": {
        "gray": "grey",
        "/": " "
      }
    },
    {
      "name": "Size",
      "jsonPath": "AdhocDataAttributes[Name=Size].value",
      "weight": 3,
      "useForMatching": true,
      "regex": "\\D"
    },
    {
      "name": "Shape",
      "jsonPath": "AdhocDataAttributes[Name=Shape].value",
      "weight": 1,
      "useForMatching": true
    }
  ],
  "dataset1OutputFields": [
    "Address",
    "ProductName"
  ]
}

Advanced Configuration

JSON Path Expressions

Dot notation: "product.details.name"
Array search: "Attributes[Name=Color].Value"
Nested arrays/objects for complex structures

Complex Nested Structures

{
  "ProductAttributes": [
    {"Type": "Color", "Value": "Red"},
    {"Type": "Size", "Value": "Large"},
    {"Type": "Material", "Value": "Cotton"}
  ],
  "Details": {
    "Pricing": {"MSRP": 29.99, "Sale": 19.99},
    "Specifications": {"Weight": "2.5 lbs"}
  }
}

Corresponding JSON paths:

Color: "ProductAttributes[Type=Color].Value"
Size: "ProductAttributes[Type=Size].Value"
MSRP: "Details.Pricing.MSRP"
Weight: "Details.Specifications.Weight"

Regular Expression Patterns

Size cleaning: remove non-digits {"regex": "\\D"}
Model normalization: keep alphanumeric {"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""}
Price extraction: strip currency symbols {"regex": "[^0-9.]"}

Size Normalization

{
  "name": "size",
  "regex": "\\D",
  "normalizationRegex": "[^0-9XLS]",
  "normalizationReplacement": ""
}

regex: Removes all non-digit characters during preprocessing
normalizationRegex: For similarity calculation, keeps only numbers and X, L, S

Model Number Cleaning

{
  "name": "model",
  "regex": "\\b(model|version|v\\d+)\\b",
  "normalizationRegex": "[^a-zA-Z0-9]",
  "normalizationReplacement": ""
}

Removes common model prefixes
Normalizes to alphanumeric only for comparison

Price Extraction

{
  "name": "price",
  "regex": "[^0-9.]",
  "normalizationRegex": "\\$|,",
  "normalizationReplacement": ""
}

Extracts numeric price values
Removes currency symbols and commas

Brand Standardization

{
  "name": "brand",
  "regex": "\\b(inc|llc|ltd|corp|company)\\b",
  "wordReplacements": {
    "apple": "apple inc",
    "hp": "hewlett packard",
    "ms": "microsoft"
  }
}

Performance Optimization

Grouping by attribute reduces N×M comparisons to subsets
- Note Ensure the group by field if in nested JSON is also included in the attributes
Use English model (all-MiniLM-L6-v2) for English-only to speed up
Limit maxMatches for large catalogs
Disable matching (useForMatching: false) on grouping fields

Grouping Strategy

Use groupByAttribute to partition products into smaller groups:

{
  "groupByAttribute": "category",
  "attributes": [
    {
      "name": "category",
      "weight": 0.5,
      "useForMatching": false
    }
  ]
}

Benefits:

Reduces comparison matrix size from N×M to smaller subsets
Improves processing speed significantly for large datasets
More accurate matches within similar product categories

Language Model Selection

Choose appropriate models based on your data:

English: "en" - Fastest, best for English-only data
Multilingual: "multilingual" - Slower but handles mixed languages
Specific Languages: "es", "fr", "de" - Optimized for specific languages

Output Format

The Actor generates matches with the following structure:

{
  "Dataset1ProductId": "PROD123",
  "Dataset2ProductId": "SKU456",
  "overallSimilarity": 0.85,
  "titleSimilarity": 0.92,
  "brandSimilarity": 1.0,
  "colorSimilarity": 0.75,
  "Dataset1Title": "Apple iPhone 13 Pro",
  "Dataset2Title": "iPhone 13 Pro - Apple",
  "Dataset1Brand": "Apple",
  "Dataset2Brand": "Apple Inc"
}

Reading the SUMMARY

After execution, a SUMMARY record is saved to KeyValueStore containing:

Total products per dataset
Number of matches and unique matches
Match rate
Model and data format used
Any collected errors with type, code, message, and suggestions

Review this summary to diagnose configuration or data issues quickly.

Best Practices

Attribute Weighting:
- High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs)
- Medium Weight (0.8-1.2): Important descriptors (brand, title)
- Low Weight (0.3-0.7): Secondary attributes (color, price)
Threshold Selection:
- High Precision (0.8-0.9): Few false positives, may miss some matches
- Balanced (0.6-0.8): Good balance of precision and recall
- High Recall (0.4-0.6): Catches more matches, requires manual review
Text Preprocessing:

Start with simple wordReplacements
Add regex for cleaning patterns
Use normalizationRegex only for similarity calculation
Validate on sample data

Scaling to Large Datasets:
- Always use groupByAttribute when > 10,000 items
- Adjust maxMatches and disable output of original values to reduce output dataset size

Troubleshooting & Error Handling

Common Issues

No matches found
- Lower the threshold value
- Verify attribute names and JSON paths
- Adjust text preprocessing rules
Too many false positives
- Increase threshold to 0.8–0.9
- Add stricter wordsToRemove or regex
- Increase weights for unique identifiers
Performance bottlenecks
- Enable groupByAttribute for large datasets
- Use the English model for English-only data
- Reduce maxMatches

Error Types

This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY.

Error Class	Code	Description
InputValidationError	PME-100	Schema or type validation failed for actor input
DataLoadingError	PME-200	CSV/JSON file not found, unreadable, or unparseable
AttributeConfigError	PME-300	Issues in the `attributes` section (missing columns, bad JSON paths, invalid weights)
ModelLoadingError	PME-400	Sentence-Transformer model fetch or cache failure
ProcessingError	PME-500	Failures during matching workflow (e.g., zero vectors, similarity computation errors)

Content Similarity Finder

fiery_dream/content-similarity-finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

Cody Churchwell

E-commerce Product Matching Tool

tri_angle/e-commerce-product-matching-tool

Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.

Tri⟁angle

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

Matěj Sochor

685

Product Matching Vectorizer

tri_angle/product-matching-vectorizer

Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.

Tri⟁angle

E-commerce Email Scraper 🔍🛒📧 - Cheap & Advanced

scrapestorm/e-commerce-email-scraper---cheap-advanced

🔍 Scrape E-commerce Emails Easily Enter your search parameters (e.g product keywords, email domains & platform) to collect verified seller or store contacts along with product title, store description & more 📊 Perfect for e-commerce lead generation, B2B outreach, product research & market analysis

Storm_Scraper

5.0

Trustpilot Scraper Pro

coder_zoro/Trustpilot-Scraper-Pro

Trustpilot Scraper Pro is a powerful Apify actor that extracts detailed business information and customer reviews from Trustpilot. Choose between two modes: scrape company data (name, rating, contact, etc.) or collect reviews with filters.

Zoro

5.0

Shopify Products Scraper Pro

n0rmaliz3/shopify-products-scraper-pro

Extract product data from any Shopify store using official JSON API. Get products, variants, prices, inventory, images, and metadata. No authentication required. Fast, accurate, and cost-effective solution for e-commerce intelligence and competitor analysis.

Normalize

AI LinkedIn Job Matcher

james.logantech/ai-linkedin-job-matcher

AI LinkedIn Job Matcher helps job seekers find the most relevant LinkedIn job postings using NLP, and OpenAI's GPT-4. It analyzes job descriptions, matches them to resumes, and ranks opportunities by relevance. Automate job searching, save time and discover the best career matches easily!