E-commerce Product Matching Tool
Pricing
Pay per event
E-commerce Product Matching Tool
Under maintenanceQuickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.
0.0 (0)
Pricing
Pay per event
0
1
0
Last modified
3 days ago
Product Vector Matcher - Apify Actor
Matches products between two vectorized indexes using FAISS similarity search. Loads pre-built indexes from Key-Value Stores and generates ranked matches based on cosine similarity.
Overview
This actor takes two FAISS indexes (created by the product-matching-vectorizer) and finds similar products between them. For each product in Index A, it searches for the most similar products in Index B and outputs ranked match results.
Key Features:
- Fast FAISS-based similarity search
- Configurable top-K results per product
- Similarity threshold filtering
- Streaming output with batch processing
- Migration recovery with automatic checkpoint/resume
- Real-time progress tracking with ETA
- Detailed performance metrics
How It Works
- Load Indexes: Loads FAISS indexes and metadata from two KV stores
- Extract Vectors: Extracts vectors from Index A for searching
- Match Products: For each product in A, finds top K similar products in B
- Filter Results: Applies similarity threshold to identify matches
- Save Output: Streams results to dataset in batches
Input Parameters
Required Parameters
kvStoreIdA(string): ID of the first Key-Value Store containing the vectorized products- Example:
"MDhkhfJXV2O3Ir7GE" - Must contain
index.faissandmetadata.jsoncreated by product-matching-vectorizer
- Example:
kvStoreIdB(string): ID of the second Key-Value Store to match against- Example:
"aBcDeFgHiJkLmNoP" - Must contain
index.faissandmetadata.jsoncreated by product-matching-vectorizer
- Example:
Optional Parameters
topK(integer, default: 5): Number of top matches to return per product- Range: 1-100
- Example:
5returns the 5 most similar products from Index B for each product in A
similarityThreshold(integer, default: 75): Minimum similarity score (0-100)- Converted to 0-1 scale internally (75 = 0.75)
- Results are marked as matches if
similarity >= threshold - Example:
75means 75% similarity or higher
maxItems(integer, optional): Limit number of products to process from Index A- Useful for testing or partial runs
- If not specified, processes all products
matchesOnly(boolean, default: false): Save only matches above thresholdfalse: Save all results (including non-matches) withis_matchflagtrue: Save only results wheresimilarity >= threshold
Input Example
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 5,"similarityThreshold": 75,"maxItems": null,"matchesOnly": false}
Output Format
Results are saved to a dataset with the following structure:
{"product_a_id": "prod_123","product_a_metadata": {"title": "Canvas Tote Bag","brand": "EcoBrand","price": 2499},"product_b_id": "prod_456","product_b_metadata": {"title": "Eco Canvas Tote","brand": "GreenGoods","price": 2599},"similarity": 0.8234,"rank": 1,"is_match": true}
Fields:
product_a_id: Product ID from Index Aproduct_a_metadata: Metadata from Index A (structure depends on vectorizer config)product_b_id: Matched product ID from Index Bproduct_b_metadata: Metadata from Index Bsimilarity: Cosine similarity score (0-1, higher = more similar)rank: Rank of this match (1 = best match, 2 = second best, etc.)is_match: Boolean flag (trueifsimilarity >= threshold)
Output Characteristics:
- Each product from Index A generates up to
topKresult rows - Results are sorted by rank (best matches first)
- If
matchesOnly=true, only rows withis_match=trueare saved - Metadata structure depends on the
metadataMappingused in the vectorizer
Migration Recovery
The actor automatically handles Apify server migrations:
- State Persistence: Progress is saved on
PERSIST_STATEevents - Checkpoint Resume: On restart, skips already processed products
- No Data Loss: All saved matches are preserved across migrations
State Management:
- State is stored in the default Key-Value Store as
product-vector-matcher-state - State is automatically cleared on successful completion
- Resume is automatic and requires no manual intervention
Usage Examples
Example 1: Basic Matching
Match products between two catalogs with default settings:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP"}
This returns the top 5 matches per product, with all results (matches and non-matches).
Example 2: High-Confidence Matches Only
Find only strong matches (80%+ similarity):
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 3,"similarityThreshold": 80,"matchesOnly": true}
This returns up to 3 matches per product, but only saves results with 80%+ similarity.
Example 3: Testing with Limited Products
Test matching on a small subset:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 5,"similarityThreshold": 75,"maxItems": 100}
This processes only the first 100 products from Index A.
Example 4: Comprehensive Search
Find many potential matches per product:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 20,"similarityThreshold": 60,"matchesOnly": false}
This returns up to 20 results per product with a lower threshold (60%).
Understanding Similarity Scores
The actor uses cosine similarity between L2-normalized embeddings:
- 1.0: Perfect match (identical vectors)
- 0.9-1.0: Extremely similar (likely same product or very close variants)
- 0.8-0.9: Very similar (likely matching products with minor differences)
- 0.7-0.8: Similar (related products, same category/brand)
- 0.6-0.7: Somewhat similar (shared characteristics)
- < 0.6: Not very similar
Note: Optimal similarity thresholds can vary significantly depending on your dataset characteristics, product categories, and data quality. It's recommended to analyze a sample of results to determine the appropriate threshold for your specific use case.
Performance Monitoring
The actor logs detailed performance metrics:
Timing Breakdown:
- Index A load time
- Index B load time
- Vector extraction time
- Total matching time
- Average time per search
Throughput:
- Products processed per second
- Total runtime
Memory:
- Initial and final memory usage
- Memory delta
Output:
- Number of batches saved
- Total matches found
- Matches above threshold
Files
Core Actor Files
src/main.py- Main actor with matching logic and migration recovery.actor/actor.json- Actor metadata.actor/input_schema.json- Input parameter schema.actor/output_schema.json- Output result schema.actor/dataset_schema.json- Dataset structure schemaDockerfile- Container definitionrequirements.txt- Python dependencies
Deployment
Deploy to Apify:
$apify push
Or connect via Git repository in the Apify Console.
Troubleshooting
Error: "index.faiss not found in KV store"
Cause: The specified KV store ID doesn't exist or doesn't contain a vectorized index.
Solution:
- Verify the KV store ID is correct
- Ensure the product-matching-vectorizer has completed successfully
- Check that the KV store contains both
index.faissandmetadata.json
Error: "metadata.json not found in KV store"
Cause: The KV store exists but is missing the metadata file.
Solution: Re-run the product-matching-vectorizer to rebuild the index.
Memory Issues
Symptom: Out of memory errors during vector extraction or matching.
Solution:
- Use
maxItemsto process products in chunks - Ensure adequate memory allocation in Actor settings
- Consider using smaller indexes or upgrading Actor memory
Related Actors
- Product Matching Vectorizer: Creates FAISS indexes from product datasets (required before using this actor)
Workflow Integration
This actor is typically used as the second step in a matching pipeline:
- Run product-matching-vectorizer on Dataset A → produces Index A (KV Store)
- Run product-matching-vectorizer on Dataset B → produces Index B (KV Store)
- Run product-vector-matcher with both KV store IDs → produces match results (Dataset)
