E-commerce Product Matching Tool avatar
E-commerce Product Matching Tool
Under maintenance

Pricing

Pay per event

Go to Apify Store
E-commerce Product Matching Tool

E-commerce Product Matching Tool

Under maintenance

Developed by

Tri⟁angle

Tri⟁angle

Maintained by Community

Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.

0.0 (0)

Pricing

Pay per event

0

1

0

Last modified

3 days ago

Product Vector Matcher - Apify Actor

Matches products between two vectorized indexes using FAISS similarity search. Loads pre-built indexes from Key-Value Stores and generates ranked matches based on cosine similarity.

Overview

This actor takes two FAISS indexes (created by the product-matching-vectorizer) and finds similar products between them. For each product in Index A, it searches for the most similar products in Index B and outputs ranked match results.

Key Features:

  • Fast FAISS-based similarity search
  • Configurable top-K results per product
  • Similarity threshold filtering
  • Streaming output with batch processing
  • Migration recovery with automatic checkpoint/resume
  • Real-time progress tracking with ETA
  • Detailed performance metrics

How It Works

  1. Load Indexes: Loads FAISS indexes and metadata from two KV stores
  2. Extract Vectors: Extracts vectors from Index A for searching
  3. Match Products: For each product in A, finds top K similar products in B
  4. Filter Results: Applies similarity threshold to identify matches
  5. Save Output: Streams results to dataset in batches

Input Parameters

Required Parameters

  • kvStoreIdA (string): ID of the first Key-Value Store containing the vectorized products
    • Example: "MDhkhfJXV2O3Ir7GE"
    • Must contain index.faiss and metadata.json created by product-matching-vectorizer
  • kvStoreIdB (string): ID of the second Key-Value Store to match against
    • Example: "aBcDeFgHiJkLmNoP"
    • Must contain index.faiss and metadata.json created by product-matching-vectorizer

Optional Parameters

  • topK (integer, default: 5): Number of top matches to return per product
    • Range: 1-100
    • Example: 5 returns the 5 most similar products from Index B for each product in A
  • similarityThreshold (integer, default: 75): Minimum similarity score (0-100)
    • Converted to 0-1 scale internally (75 = 0.75)
    • Results are marked as matches if similarity >= threshold
    • Example: 75 means 75% similarity or higher
  • maxItems (integer, optional): Limit number of products to process from Index A
    • Useful for testing or partial runs
    • If not specified, processes all products
  • matchesOnly (boolean, default: false): Save only matches above threshold
    • false: Save all results (including non-matches) with is_match flag
    • true: Save only results where similarity >= threshold

Input Example

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 5,
"similarityThreshold": 75,
"maxItems": null,
"matchesOnly": false
}

Output Format

Results are saved to a dataset with the following structure:

{
"product_a_id": "prod_123",
"product_a_metadata": {
"title": "Canvas Tote Bag",
"brand": "EcoBrand",
"price": 2499
},
"product_b_id": "prod_456",
"product_b_metadata": {
"title": "Eco Canvas Tote",
"brand": "GreenGoods",
"price": 2599
},
"similarity": 0.8234,
"rank": 1,
"is_match": true
}

Fields:

  • product_a_id: Product ID from Index A
  • product_a_metadata: Metadata from Index A (structure depends on vectorizer config)
  • product_b_id: Matched product ID from Index B
  • product_b_metadata: Metadata from Index B
  • similarity: Cosine similarity score (0-1, higher = more similar)
  • rank: Rank of this match (1 = best match, 2 = second best, etc.)
  • is_match: Boolean flag (true if similarity >= threshold)

Output Characteristics:

  • Each product from Index A generates up to topK result rows
  • Results are sorted by rank (best matches first)
  • If matchesOnly=true, only rows with is_match=true are saved
  • Metadata structure depends on the metadataMapping used in the vectorizer

Migration Recovery

The actor automatically handles Apify server migrations:

  1. State Persistence: Progress is saved on PERSIST_STATE events
  2. Checkpoint Resume: On restart, skips already processed products
  3. No Data Loss: All saved matches are preserved across migrations

State Management:

  • State is stored in the default Key-Value Store as product-vector-matcher-state
  • State is automatically cleared on successful completion
  • Resume is automatic and requires no manual intervention

Usage Examples

Example 1: Basic Matching

Match products between two catalogs with default settings:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP"
}

This returns the top 5 matches per product, with all results (matches and non-matches).

Example 2: High-Confidence Matches Only

Find only strong matches (80%+ similarity):

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 3,
"similarityThreshold": 80,
"matchesOnly": true
}

This returns up to 3 matches per product, but only saves results with 80%+ similarity.

Example 3: Testing with Limited Products

Test matching on a small subset:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 5,
"similarityThreshold": 75,
"maxItems": 100
}

This processes only the first 100 products from Index A.

Find many potential matches per product:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 20,
"similarityThreshold": 60,
"matchesOnly": false
}

This returns up to 20 results per product with a lower threshold (60%).

Understanding Similarity Scores

The actor uses cosine similarity between L2-normalized embeddings:

  • 1.0: Perfect match (identical vectors)
  • 0.9-1.0: Extremely similar (likely same product or very close variants)
  • 0.8-0.9: Very similar (likely matching products with minor differences)
  • 0.7-0.8: Similar (related products, same category/brand)
  • 0.6-0.7: Somewhat similar (shared characteristics)
  • < 0.6: Not very similar

Note: Optimal similarity thresholds can vary significantly depending on your dataset characteristics, product categories, and data quality. It's recommended to analyze a sample of results to determine the appropriate threshold for your specific use case.

Performance Monitoring

The actor logs detailed performance metrics:

Timing Breakdown:

  • Index A load time
  • Index B load time
  • Vector extraction time
  • Total matching time
  • Average time per search

Throughput:

  • Products processed per second
  • Total runtime

Memory:

  • Initial and final memory usage
  • Memory delta

Output:

  • Number of batches saved
  • Total matches found
  • Matches above threshold

Files

Core Actor Files

  • src/main.py - Main actor with matching logic and migration recovery
  • .actor/actor.json - Actor metadata
  • .actor/input_schema.json - Input parameter schema
  • .actor/output_schema.json - Output result schema
  • .actor/dataset_schema.json - Dataset structure schema
  • Dockerfile - Container definition
  • requirements.txt - Python dependencies

Deployment

Deploy to Apify:

$apify push

Or connect via Git repository in the Apify Console.

Troubleshooting

Error: "index.faiss not found in KV store"

Cause: The specified KV store ID doesn't exist or doesn't contain a vectorized index.

Solution:

  • Verify the KV store ID is correct
  • Ensure the product-matching-vectorizer has completed successfully
  • Check that the KV store contains both index.faiss and metadata.json

Error: "metadata.json not found in KV store"

Cause: The KV store exists but is missing the metadata file.

Solution: Re-run the product-matching-vectorizer to rebuild the index.

Memory Issues

Symptom: Out of memory errors during vector extraction or matching.

Solution:

  • Use maxItems to process products in chunks
  • Ensure adequate memory allocation in Actor settings
  • Consider using smaller indexes or upgrading Actor memory
  • Product Matching Vectorizer: Creates FAISS indexes from product datasets (required before using this actor)

Workflow Integration

This actor is typically used as the second step in a matching pipeline:

  1. Run product-matching-vectorizer on Dataset A → produces Index A (KV Store)
  2. Run product-matching-vectorizer on Dataset B → produces Index B (KV Store)
  3. Run product-vector-matcher with both KV store IDs → produces match results (Dataset)