E-commerce Product Matching Tool avatar

E-commerce Product Matching Tool

Pricing

from $5.00 / 1,000 results

Go to Apify Store
E-commerce Product Matching Tool

E-commerce Product Matching Tool

Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Tri⟁angle

Tri⟁angle

Maintained by Apify

Actor stats

0

Bookmarked

5

Total users

1

Monthly active users

2 days ago

Last modified

Share

Product Vector Matcher - Apify Actor

Matches products between two vectorized indexes using FAISS similarity search. Loads pre-built indexes from Key-Value Stores and generates ranked matches based on cosine similarity.

Overview

This actor takes two FAISS indexes (created by the product-matching-vectorizer) and finds similar products between them. For each product in Index A, it searches for the most similar products in Index B and outputs ranked match results.

Key Features:

  • Fast FAISS-based similarity search
  • Configurable top-K results per product
  • Similarity threshold filtering
  • Streaming output with batch processing
  • Migration recovery with automatic checkpoint/resume
  • Real-time progress tracking with ETA
  • Detailed performance metrics

How It Works

  1. Load Manifests: Downloads manifest.json from both KV stores to discover chunks
  2. Chunk-by-Chunk Matching: For each chunk of Index A, searches against all chunks of Index B
  3. Cross-Chunk Merging: Merges top-K results across all B chunks using min-heaps
  4. Filter & Output: Applies similarity threshold and streams results to dataset

Processing Flow

Load manifest_A and manifest_B from KV stores
|
v
For each chunk_a in A's chunks:
Load chunk_a into RAM (FAISS index + ids + metadata, ~15MB)
Initialize TopKAccumulator(top_k) for cross-chunk merging
|
v
For each chunk_b in B's chunks:
Load chunk_b into RAM (~15MB)
|
v
For each batch of vectors in chunk_a (1000 at a time):
Reconstruct vectors from chunk_a's FAISS index
Search against chunk_b's FAISS index -> top-K per product
Merge results into accumulator (keeps best K across all B chunks)
|
v
Free chunk_b from memory
|
v
# All B chunks searched -- emit final results for this A chunk
For each product in chunk_a:
Get global top-K from accumulator (merged across all B chunks)
Output match results with metadata
|
v
Mark chunk_a as completed, save state checkpoint
Free chunk_a + accumulator from memory
|
v
Clear state, done
Memory at any point: ~30MB (1 A chunk + 1 B chunk) + accumulator
No mmap needed -- each chunk fits entirely in RAM

Input Parameters

Required Parameters

  • kvStoreIdA (string): ID of the first Key-Value Store containing the vectorized products
    • Example: "MDhkhfJXV2O3Ir7GE"
    • Must contain manifest.json and chunk files created by product-matching-vectorizer (v2.0+)
  • kvStoreIdB (string): ID of the second Key-Value Store to match against
    • Example: "aBcDeFgHiJkLmNoP"
    • Must contain manifest.json and chunk files created by product-matching-vectorizer (v2.0+)

Optional Parameters

  • topK (integer, default: 5): Number of top matches to return per product
    • Range: 1-100
    • Example: 5 returns the 5 most similar products from Index B for each product in A
  • similarityThreshold (integer, default: 75): Minimum similarity score (0-100)
    • Converted to 0-1 scale internally (75 = 0.75)
    • Results are marked as matches if similarity >= threshold
    • Example: 75 means 75% similarity or higher
  • maxItems (integer, optional): Limit number of products to process from Index A
    • Useful for testing or partial runs
    • If not specified, processes all products
  • matchesOnly (boolean, default: false): Save only matches above threshold
    • false: Save all results (including non-matches) with is_match flag
    • true: Save only results where similarity >= threshold

Input Example

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 5,
"similarityThreshold": 75,
"maxItems": null,
"matchesOnly": false
}

Output Format

Results are saved to a dataset with the following structure:

{
"product_a_id": "prod_123",
"product_a_metadata": {
"title": "Canvas Tote Bag",
"brand": "EcoBrand",
"price": 2499
},
"product_b_id": "prod_456",
"product_b_metadata": {
"title": "Eco Canvas Tote",
"brand": "GreenGoods",
"price": 2599
},
"similarity": 0.8234,
"rank": 1,
"is_match": true
}

Fields:

  • product_a_id: Product ID from Index A
  • product_a_metadata: Metadata from Index A (structure depends on vectorizer config)
  • product_b_id: Matched product ID from Index B
  • product_b_metadata: Metadata from Index B
  • similarity: Cosine similarity score (0-1, higher = more similar)
  • rank: Rank of this match (1 = best match, 2 = second best, etc.)
  • is_match: Boolean flag (true if similarity >= threshold)

Output Characteristics:

  • Each product from Index A generates up to topK result rows
  • Results are sorted by rank (best matches first)
  • If matchesOnly=true, only rows with is_match=true are saved
  • Metadata structure depends on the metadataMapping used in the vectorizer

Migration Recovery

The actor tracks progress at the chunk level:

  • Completed A chunks are recorded in state and skipped on resume
  • Within the current A chunk, already-emitted product IDs are tracked
  • On restart: re-searches all B chunks for the current A chunk (fast: <1s per 10k x 10k search)
  • No duplicate output: processed_ids prevents re-emission

State Management:

  • State is stored in the default Key-Value Store as product-vector-matcher-state
  • State is automatically cleared on successful completion
  • Resume is automatic and requires no manual intervention

Usage Examples

Example 1: Basic Matching

Match products between two catalogs with default settings:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP"
}

This returns the top 5 matches per product, with all results (matches and non-matches).

Example 2: High-Confidence Matches Only

Find only strong matches (80%+ similarity):

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 3,
"similarityThreshold": 80,
"matchesOnly": true
}

This returns up to 3 matches per product, but only saves results with 80%+ similarity.

Example 3: Testing with Limited Products

Test matching on a small subset:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 5,
"similarityThreshold": 75,
"maxItems": 100
}

This processes only the first 100 products from Index A.

Find many potential matches per product:

{
"kvStoreIdA": "MDhkhfJXV2O3Ir7GE",
"kvStoreIdB": "aBcDeFgHiJkLmNoP",
"topK": 20,
"similarityThreshold": 60,
"matchesOnly": false
}

This returns up to 20 results per product with a lower threshold (60%).

Understanding Similarity Scores

The actor uses cosine similarity between L2-normalized embeddings:

  • 1.0: Perfect match (identical vectors)
  • 0.9-1.0: Extremely similar (likely same product or very close variants)
  • 0.8-0.9: Very similar (likely matching products with minor differences)
  • 0.7-0.8: Similar (related products, same category/brand)
  • 0.6-0.7: Somewhat similar (shared characteristics)
  • < 0.6: Not very similar

Note: Optimal similarity thresholds can vary significantly depending on your dataset characteristics, product categories, and data quality. It's recommended to analyze a sample of results to determine the appropriate threshold for your specific use case.

Performance Monitoring

The actor logs detailed performance metrics:

Timing Breakdown:

  • Index A load time
  • Index B load time
  • Vector extraction time
  • Total matching time
  • Average time per search

Throughput:

  • Products processed per second
  • Total runtime

Memory:

  • Initial and final memory usage
  • Memory delta

Output:

  • Number of batches saved
  • Total matches found
  • Matches above threshold

Files

Core Actor Files

  • src/main.py - Main actor with matching logic and migration recovery
  • .actor/actor.json - Actor metadata
  • .actor/input_schema.json - Input parameter schema
  • .actor/output_schema.json - Output result schema
  • .actor/dataset_schema.json - Dataset structure schema
  • Dockerfile - Container definition
  • requirements.txt - Python dependencies

Deployment

Deploy to Apify:

$apify push

Or connect via Git repository in the Apify Console.

Troubleshooting

Error: "manifest.json not found or empty in KV store"

Cause: The specified KV store doesn't contain a chunked index (requires vectorizer v2.0+).

Solution:

  • Verify the KV store ID is correct
  • Ensure the product-matching-vectorizer (v2.0+) has completed successfully
  • Check that the KV store contains manifest.json and chunk files

Memory Issues

Symptom: Out of memory errors during matching.

Solution:

  • Each chunk is ~15 MB, so peak memory is ~30 MB (1 A chunk + 1 B chunk) plus accumulator
  • If still hitting limits, ensure adequate memory allocation in Actor settings
  • Use maxItems to limit the number of products processed
  • Product Matching Vectorizer: Creates FAISS indexes from product datasets (required before using this actor)

Workflow Integration

This actor is typically used as the second step in a matching pipeline:

  1. Run product-matching-vectorizer on Dataset A → produces Index A (KV Store)
  2. Run product-matching-vectorizer on Dataset B → produces Index B (KV Store)
  3. Run product-vector-matcher with both KV store IDs → produces match results (Dataset)