E-commerce Product Matching Tool
Pricing
Pay per event
E-commerce Product Matching Tool
Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Tri⟁angle
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
25 days ago
Last modified
Categories
Share
Product Vector Matcher - Apify Actor
Matches products between two vectorized indexes using FAISS similarity search. Loads pre-built indexes from Key-Value Stores and generates ranked matches based on cosine similarity.
Overview
This actor takes two FAISS indexes (created by the product-matching-vectorizer) and finds similar products between them. For each product in Index A, it searches for the most similar products in Index B and outputs ranked match results.
Key Features:
- Fast FAISS-based similarity search
- Configurable top-K results per product
- Similarity threshold filtering
- Streaming output with batch processing
- Migration recovery with automatic checkpoint/resume
- Real-time progress tracking with ETA
- Detailed performance metrics
How It Works
- Load Indexes: Loads FAISS indexes and metadata from two KV stores
- Extract Vectors: Extracts vectors from Index A for searching
- Match Products: For each product in A, finds top K similar products in B
- Filter Results: Applies similarity threshold to identify matches
- Save Output: Streams results to dataset in batches
Input Parameters
Required Parameters
kvStoreIdA(string): ID of the first Key-Value Store containing the vectorized products- Example:
"MDhkhfJXV2O3Ir7GE" - Must contain
index.faissandmetadata.jsoncreated by product-matching-vectorizer
- Example:
kvStoreIdB(string): ID of the second Key-Value Store to match against- Example:
"aBcDeFgHiJkLmNoP" - Must contain
index.faissandmetadata.jsoncreated by product-matching-vectorizer
- Example:
Optional Parameters
topK(integer, default: 5): Number of top matches to return per product- Range: 1-100
- Example:
5returns the 5 most similar products from Index B for each product in A
similarityThreshold(integer, default: 75): Minimum similarity score (0-100)- Converted to 0-1 scale internally (75 = 0.75)
- Results are marked as matches if
similarity >= threshold - Example:
75means 75% similarity or higher
maxItems(integer, optional): Limit number of products to process from Index A- Useful for testing or partial runs
- If not specified, processes all products
matchesOnly(boolean, default: false): Save only matches above thresholdfalse: Save all results (including non-matches) withis_matchflagtrue: Save only results wheresimilarity >= threshold
Input Example
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 5,"similarityThreshold": 75,"maxItems": null,"matchesOnly": false}
Output Format
Results are saved to a dataset with the following structure:
{"product_a_id": "prod_123","product_a_metadata": {"title": "Canvas Tote Bag","brand": "EcoBrand","price": 2499},"product_b_id": "prod_456","product_b_metadata": {"title": "Eco Canvas Tote","brand": "GreenGoods","price": 2599},"similarity": 0.8234,"rank": 1,"is_match": true}
Fields:
product_a_id: Product ID from Index Aproduct_a_metadata: Metadata from Index A (structure depends on vectorizer config)product_b_id: Matched product ID from Index Bproduct_b_metadata: Metadata from Index Bsimilarity: Cosine similarity score (0-1, higher = more similar)rank: Rank of this match (1 = best match, 2 = second best, etc.)is_match: Boolean flag (trueifsimilarity >= threshold)
Output Characteristics:
- Each product from Index A generates up to
topKresult rows - Results are sorted by rank (best matches first)
- If
matchesOnly=true, only rows withis_match=trueare saved - Metadata structure depends on the
metadataMappingused in the vectorizer
Migration Recovery
The actor automatically handles Apify server migrations:
- State Persistence: Progress is saved on
PERSIST_STATEevents - Checkpoint Resume: On restart, skips already processed products
- No Data Loss: All saved matches are preserved across migrations
State Management:
- State is stored in the default Key-Value Store as
product-vector-matcher-state - State is automatically cleared on successful completion
- Resume is automatic and requires no manual intervention
Usage Examples
Example 1: Basic Matching
Match products between two catalogs with default settings:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP"}
This returns the top 5 matches per product, with all results (matches and non-matches).
Example 2: High-Confidence Matches Only
Find only strong matches (80%+ similarity):
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 3,"similarityThreshold": 80,"matchesOnly": true}
This returns up to 3 matches per product, but only saves results with 80%+ similarity.
Example 3: Testing with Limited Products
Test matching on a small subset:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 5,"similarityThreshold": 75,"maxItems": 100}
This processes only the first 100 products from Index A.
Example 4: Comprehensive Search
Find many potential matches per product:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 20,"similarityThreshold": 60,"matchesOnly": false}
This returns up to 20 results per product with a lower threshold (60%).
Understanding Similarity Scores
The actor uses cosine similarity between L2-normalized embeddings:
- 1.0: Perfect match (identical vectors)
- 0.9-1.0: Extremely similar (likely same product or very close variants)
- 0.8-0.9: Very similar (likely matching products with minor differences)
- 0.7-0.8: Similar (related products, same category/brand)
- 0.6-0.7: Somewhat similar (shared characteristics)
- < 0.6: Not very similar
Note: Optimal similarity thresholds can vary significantly depending on your dataset characteristics, product categories, and data quality. It's recommended to analyze a sample of results to determine the appropriate threshold for your specific use case.
Performance Monitoring
The actor logs detailed performance metrics:
Timing Breakdown:
- Index A load time
- Index B load time
- Vector extraction time
- Total matching time
- Average time per search
Throughput:
- Products processed per second
- Total runtime
Memory:
- Initial and final memory usage
- Memory delta
Output:
- Number of batches saved
- Total matches found
- Matches above threshold
Files
Core Actor Files
src/main.py- Main actor with matching logic and migration recovery.actor/actor.json- Actor metadata.actor/input_schema.json- Input parameter schema.actor/output_schema.json- Output result schema.actor/dataset_schema.json- Dataset structure schemaDockerfile- Container definitionrequirements.txt- Python dependencies
Deployment
Deploy to Apify:
$apify push
Or connect via Git repository in the Apify Console.
Troubleshooting
Error: "index.faiss not found in KV store"
Cause: The specified KV store ID doesn't exist or doesn't contain a vectorized index.
Solution:
- Verify the KV store ID is correct
- Ensure the product-matching-vectorizer has completed successfully
- Check that the KV store contains both
index.faissandmetadata.json
Error: "metadata.json not found in KV store"
Cause: The KV store exists but is missing the metadata file.
Solution: Re-run the product-matching-vectorizer to rebuild the index.
Memory Issues
Symptom: Out of memory errors during vector extraction or matching.
Solution:
- Use
maxItemsto process products in chunks - Ensure adequate memory allocation in Actor settings
- Consider using smaller indexes or upgrading Actor memory
Related Actors
- Product Matching Vectorizer: Creates FAISS indexes from product datasets (required before using this actor)
Workflow Integration
This actor is typically used as the second step in a matching pipeline:
- Run product-matching-vectorizer on Dataset A → produces Index A (KV Store)
- Run product-matching-vectorizer on Dataset B → produces Index B (KV Store)
- Run product-vector-matcher with both KV store IDs → produces match results (Dataset)


