E-commerce Product Matching Tool avatar
E-commerce Product Matching Tool
Under maintenance

Pricing

Pay per event

Go to Apify Store
E-commerce Product Matching Tool

E-commerce Product Matching Tool

Under maintenance

Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Tri⟁angle

Tri⟁angle

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

25 days ago

Last modified

Share

Product Vector Matcher - Apify Actor

Matches products between two vectorized indexes using FAISS similarity search. Loads pre-built indexes from Key-Value Stores and generates ranked matches based on cosine similarity.

Overview

This actor takes two FAISS indexes (created by the product-matching-vectorizer) and finds similar products between them. For each product in Index A, it searches for the most similar products in Index B and outputs ranked match results.

Key Features:

  • Fast FAISS-based similarity search
  • Configurable top-K results per product
  • Similarity threshold filtering
  • Streaming output with batch processing
  • Migration recovery with automatic checkpoint/resume
  • Real-time progress tracking with ETA
  • Detailed performance metrics

How It Works

  1. Load Indexes: Loads FAISS indexes and metadata from two KV stores
  2. Extract Vectors: Extracts vectors from Index A for searching
  3. Match Products: For each product in A, finds top K similar products in B
  4. Filter Results: Applies similarity threshold to identify matches
  5. Save Output: Streams results to dataset in batches

Input Parameters

Required Parameters

  • kvStoreIdA (string): ID of the first Key-Value Store containing the vectorized products
    • Example: "MDhkhfJXV2O3Ir7GE"
    • Must contain index.faiss and metadata.json created by product-matching-vectorizer
  • kvStoreIdB (string): ID of the second Key-Value Store to match against
    • Example: "aBcDeFgHiJkLmNoP"
    • Must contain index.faiss and metadata.json created by product-matching-vectorizer

Optional Parameters

  • topK (integer, default: 5): Number of top matches to return per product
    • Range: 1-100
    • Example: 5 returns the 5 most similar products from Index B for each product in A
  • similarityThreshold (integer, default: 75): Minimum similarity score (0-100)
    • Converted to 0-1 scale internally (75 = 0.75)
    • Results are marked as matches if similarity >= threshold
    • Example: 75 means 75% similarity or higher
  • maxItems (integer, optional): Limit number of products to process from Index A
    • Useful for testing or partial runs
    • If not specified, processes all products
  • matchesOnly (boolean, default: false): Save only matches above threshold
    • false: Save all results (including non-matches) with is_match flag
    • true: Save only results where similarity >= threshold

Input Example

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 5,
"similarityThreshold": 75,
"maxItems": null,
"matchesOnly": false
}

Output Format

Results are saved to a dataset with the following structure:

{
"product_a_id": "prod_123",
"product_a_metadata": {
"title": "Canvas Tote Bag",
"brand": "EcoBrand",
"price": 2499
},
"product_b_id": "prod_456",
"product_b_metadata": {
"title": "Eco Canvas Tote",
"brand": "GreenGoods",
"price": 2599
},
"similarity": 0.8234,
"rank": 1,
"is_match": true
}

Fields:

  • product_a_id: Product ID from Index A
  • product_a_metadata: Metadata from Index A (structure depends on vectorizer config)
  • product_b_id: Matched product ID from Index B
  • product_b_metadata: Metadata from Index B
  • similarity: Cosine similarity score (0-1, higher = more similar)
  • rank: Rank of this match (1 = best match, 2 = second best, etc.)
  • is_match: Boolean flag (true if similarity >= threshold)

Output Characteristics:

  • Each product from Index A generates up to topK result rows
  • Results are sorted by rank (best matches first)
  • If matchesOnly=true, only rows with is_match=true are saved
  • Metadata structure depends on the metadataMapping used in the vectorizer

Migration Recovery

The actor automatically handles Apify server migrations:

  1. State Persistence: Progress is saved on PERSIST_STATE events
  2. Checkpoint Resume: On restart, skips already processed products
  3. No Data Loss: All saved matches are preserved across migrations

State Management:

  • State is stored in the default Key-Value Store as product-vector-matcher-state
  • State is automatically cleared on successful completion
  • Resume is automatic and requires no manual intervention

Usage Examples

Example 1: Basic Matching

Match products between two catalogs with default settings:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP"
}

This returns the top 5 matches per product, with all results (matches and non-matches).

Example 2: High-Confidence Matches Only

Find only strong matches (80%+ similarity):

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 3,
"similarityThreshold": 80,
"matchesOnly": true
}

This returns up to 3 matches per product, but only saves results with 80%+ similarity.

Example 3: Testing with Limited Products

Test matching on a small subset:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 5,
"similarityThreshold": 75,
"maxItems": 100
}

This processes only the first 100 products from Index A.

Find many potential matches per product:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 20,
"similarityThreshold": 60,
"matchesOnly": false
}

This returns up to 20 results per product with a lower threshold (60%).

Understanding Similarity Scores

The actor uses cosine similarity between L2-normalized embeddings:

  • 1.0: Perfect match (identical vectors)
  • 0.9-1.0: Extremely similar (likely same product or very close variants)
  • 0.8-0.9: Very similar (likely matching products with minor differences)
  • 0.7-0.8: Similar (related products, same category/brand)
  • 0.6-0.7: Somewhat similar (shared characteristics)
  • < 0.6: Not very similar

Note: Optimal similarity thresholds can vary significantly depending on your dataset characteristics, product categories, and data quality. It's recommended to analyze a sample of results to determine the appropriate threshold for your specific use case.

Performance Monitoring

The actor logs detailed performance metrics:

Timing Breakdown:

  • Index A load time
  • Index B load time
  • Vector extraction time
  • Total matching time
  • Average time per search

Throughput:

  • Products processed per second
  • Total runtime

Memory:

  • Initial and final memory usage
  • Memory delta

Output:

  • Number of batches saved
  • Total matches found
  • Matches above threshold

Files

Core Actor Files

  • src/main.py - Main actor with matching logic and migration recovery
  • .actor/actor.json - Actor metadata
  • .actor/input_schema.json - Input parameter schema
  • .actor/output_schema.json - Output result schema
  • .actor/dataset_schema.json - Dataset structure schema
  • Dockerfile - Container definition
  • requirements.txt - Python dependencies

Deployment

Deploy to Apify:

$apify push

Or connect via Git repository in the Apify Console.

Troubleshooting

Error: "index.faiss not found in KV store"

Cause: The specified KV store ID doesn't exist or doesn't contain a vectorized index.

Solution:

  • Verify the KV store ID is correct
  • Ensure the product-matching-vectorizer has completed successfully
  • Check that the KV store contains both index.faiss and metadata.json

Error: "metadata.json not found in KV store"

Cause: The KV store exists but is missing the metadata file.

Solution: Re-run the product-matching-vectorizer to rebuild the index.

Memory Issues

Symptom: Out of memory errors during vector extraction or matching.

Solution:

  • Use maxItems to process products in chunks
  • Ensure adequate memory allocation in Actor settings
  • Consider using smaller indexes or upgrading Actor memory
  • Product Matching Vectorizer: Creates FAISS indexes from product datasets (required before using this actor)

Workflow Integration

This actor is typically used as the second step in a matching pipeline:

  1. Run product-matching-vectorizer on Dataset A → produces Index A (KV Store)
  2. Run product-matching-vectorizer on Dataset B → produces Index B (KV Store)
  3. Run product-vector-matcher with both KV store IDs → produces match results (Dataset)