E-commerce Product Matching Tool
Pricing
from $5.00 / 1,000 results
E-commerce Product Matching Tool
Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer

Tri⟁angle
Actor stats
0
Bookmarked
5
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Product Vector Matcher - Apify Actor
Matches products between two vectorized indexes using FAISS similarity search. Loads pre-built indexes from Key-Value Stores and generates ranked matches based on cosine similarity.
Overview
This actor takes two FAISS indexes (created by the product-matching-vectorizer) and finds similar products between them. For each product in Index A, it searches for the most similar products in Index B and outputs ranked match results.
Key Features:
- Fast FAISS-based similarity search
- Configurable top-K results per product
- Similarity threshold filtering
- Streaming output with batch processing
- Migration recovery with automatic checkpoint/resume
- Real-time progress tracking with ETA
- Detailed performance metrics
How It Works
- Load Manifests: Downloads
manifest.jsonfrom both KV stores to discover chunks - Chunk-by-Chunk Matching: For each chunk of Index A, searches against all chunks of Index B
- Cross-Chunk Merging: Merges top-K results across all B chunks using min-heaps
- Filter & Output: Applies similarity threshold and streams results to dataset
Processing Flow
Load manifest_A and manifest_B from KV stores|vFor each chunk_a in A's chunks:Load chunk_a into RAM (FAISS index + ids + metadata, ~15MB)Initialize TopKAccumulator(top_k) for cross-chunk merging|vFor each chunk_b in B's chunks:Load chunk_b into RAM (~15MB)|vFor each batch of vectors in chunk_a (1000 at a time):Reconstruct vectors from chunk_a's FAISS indexSearch against chunk_b's FAISS index -> top-K per productMerge results into accumulator (keeps best K across all B chunks)|vFree chunk_b from memory|v# All B chunks searched -- emit final results for this A chunkFor each product in chunk_a:Get global top-K from accumulator (merged across all B chunks)Output match results with metadata|vMark chunk_a as completed, save state checkpointFree chunk_a + accumulator from memory|vClear state, doneMemory at any point: ~30MB (1 A chunk + 1 B chunk) + accumulatorNo mmap needed -- each chunk fits entirely in RAM
Input Parameters
Required Parameters
kvStoreIdA(string): ID of the first Key-Value Store containing the vectorized products- Example:
"MDhkhfJXV2O3Ir7GE" - Must contain
manifest.jsonand chunk files created by product-matching-vectorizer (v2.0+)
- Example:
kvStoreIdB(string): ID of the second Key-Value Store to match against- Example:
"aBcDeFgHiJkLmNoP" - Must contain
manifest.jsonand chunk files created by product-matching-vectorizer (v2.0+)
- Example:
Optional Parameters
topK(integer, default: 5): Number of top matches to return per product- Range: 1-100
- Example:
5returns the 5 most similar products from Index B for each product in A
similarityThreshold(integer, default: 75): Minimum similarity score (0-100)- Converted to 0-1 scale internally (75 = 0.75)
- Results are marked as matches if
similarity >= threshold - Example:
75means 75% similarity or higher
maxItems(integer, optional): Limit number of products to process from Index A- Useful for testing or partial runs
- If not specified, processes all products
matchesOnly(boolean, default: false): Save only matches above thresholdfalse: Save all results (including non-matches) withis_matchflagtrue: Save only results wheresimilarity >= threshold
Input Example
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 5,"similarityThreshold": 75,"maxItems": null,"matchesOnly": false}
Output Format
Results are saved to a dataset with the following structure:
{"product_a_id": "prod_123","product_a_metadata": {"title": "Canvas Tote Bag","brand": "EcoBrand","price": 2499},"product_b_id": "prod_456","product_b_metadata": {"title": "Eco Canvas Tote","brand": "GreenGoods","price": 2599},"similarity": 0.8234,"rank": 1,"is_match": true}
Fields:
product_a_id: Product ID from Index Aproduct_a_metadata: Metadata from Index A (structure depends on vectorizer config)product_b_id: Matched product ID from Index Bproduct_b_metadata: Metadata from Index Bsimilarity: Cosine similarity score (0-1, higher = more similar)rank: Rank of this match (1 = best match, 2 = second best, etc.)is_match: Boolean flag (trueifsimilarity >= threshold)
Output Characteristics:
- Each product from Index A generates up to
topKresult rows - Results are sorted by rank (best matches first)
- If
matchesOnly=true, only rows withis_match=trueare saved - Metadata structure depends on the
metadataMappingused in the vectorizer
Migration Recovery
The actor tracks progress at the chunk level:
- Completed A chunks are recorded in state and skipped on resume
- Within the current A chunk, already-emitted product IDs are tracked
- On restart: re-searches all B chunks for the current A chunk (fast: <1s per 10k x 10k search)
- No duplicate output:
processed_idsprevents re-emission
State Management:
- State is stored in the default Key-Value Store as
product-vector-matcher-state - State is automatically cleared on successful completion
- Resume is automatic and requires no manual intervention
Usage Examples
Example 1: Basic Matching
Match products between two catalogs with default settings:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP"}
This returns the top 5 matches per product, with all results (matches and non-matches).
Example 2: High-Confidence Matches Only
Find only strong matches (80%+ similarity):
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 3,"similarityThreshold": 80,"matchesOnly": true}
This returns up to 3 matches per product, but only saves results with 80%+ similarity.
Example 3: Testing with Limited Products
Test matching on a small subset:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 5,"similarityThreshold": 75,"maxItems": 100}
This processes only the first 100 products from Index A.
Example 4: Comprehensive Search
Find many potential matches per product:
{"kvStoreIdA": "MDhkhfJXV2O3Ir7GE","kvStoreIdB": "aBcDeFgHiJkLmNoP","topK": 20,"similarityThreshold": 60,"matchesOnly": false}
This returns up to 20 results per product with a lower threshold (60%).
Understanding Similarity Scores
The actor uses cosine similarity between L2-normalized embeddings:
- 1.0: Perfect match (identical vectors)
- 0.9-1.0: Extremely similar (likely same product or very close variants)
- 0.8-0.9: Very similar (likely matching products with minor differences)
- 0.7-0.8: Similar (related products, same category/brand)
- 0.6-0.7: Somewhat similar (shared characteristics)
- < 0.6: Not very similar
Note: Optimal similarity thresholds can vary significantly depending on your dataset characteristics, product categories, and data quality. It's recommended to analyze a sample of results to determine the appropriate threshold for your specific use case.
Performance Monitoring
The actor logs detailed performance metrics:
Timing Breakdown:
- Index A load time
- Index B load time
- Vector extraction time
- Total matching time
- Average time per search
Throughput:
- Products processed per second
- Total runtime
Memory:
- Initial and final memory usage
- Memory delta
Output:
- Number of batches saved
- Total matches found
- Matches above threshold
Files
Core Actor Files
src/main.py- Main actor with matching logic and migration recovery.actor/actor.json- Actor metadata.actor/input_schema.json- Input parameter schema.actor/output_schema.json- Output result schema.actor/dataset_schema.json- Dataset structure schemaDockerfile- Container definitionrequirements.txt- Python dependencies
Deployment
Deploy to Apify:
$apify push
Or connect via Git repository in the Apify Console.
Troubleshooting
Error: "manifest.json not found or empty in KV store"
Cause: The specified KV store doesn't contain a chunked index (requires vectorizer v2.0+).
Solution:
- Verify the KV store ID is correct
- Ensure the product-matching-vectorizer (v2.0+) has completed successfully
- Check that the KV store contains
manifest.jsonand chunk files
Memory Issues
Symptom: Out of memory errors during matching.
Solution:
- Each chunk is ~15 MB, so peak memory is ~30 MB (1 A chunk + 1 B chunk) plus accumulator
- If still hitting limits, ensure adequate memory allocation in Actor settings
- Use
maxItemsto limit the number of products processed
Related Actors
- Product Matching Vectorizer: Creates FAISS indexes from product datasets (required before using this actor)
Workflow Integration
This actor is typically used as the second step in a matching pipeline:
- Run product-matching-vectorizer on Dataset A → produces Index A (KV Store)
- Run product-matching-vectorizer on Dataset B → produces Index B (KV Store)
- Run product-vector-matcher with both KV store IDs → produces match results (Dataset)


