Product Matching Vectorizer
Pricing
Pay per usage
Product Matching Vectorizer
Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.
0.0 (0)
Pricing
Pay per usage
0
1
0
Last modified
a day ago
Product Matching Vectorizer - Apify Actor
Builds a FAISS vector database from products in an Apify dataset using a fine-tuned ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search.
Overview
This actor:
- Loads products from an Apify dataset (using pagination)
- Extracts fields using flexible dot-notation mapping
- Generates 384-dimensional embeddings using fine-tuned ONNX model
- Builds a FAISS index for similarity search
- Saves index + metadata to a named Key-Value Store
Key Features:
- Flexible mapping: Extract nested fields with dot notation (e.g.,
product.name.translated) - Metadata options: Store full items or selective fields
- Migration recovery: Automatic checkpoint and resume on server migration
- Batch processing: Efficient vectorization in configurable batches
- Progress tracking: Real-time status updates with ETA
Required vs Optional Fields
The product matching model uses exactly 5 fields for generating embeddings:
Required Fields
These fields MUST be provided for every product:
titlePath- Product name/titlebrandPath- Brand namecategoryPath- Product category
Optional Fields
These fields improve matching quality when available:
descriptionPath- Product description (highly recommended for all products)specificationsPath- Technical specifications (highly recommended for technical products)
When to use specifications:
- Electronics - Screen size, processor, RAM, storage, etc.
- Appliances - Dimensions, power, capacity, features
- Technical gear - Materials, measurements, technical details
- Fashion/Clothing - Size/color go in metadata, not embeddings
- Simple products - Most non-technical products don't need this
Important Notes:
- The model does NOT use other fields like price, SKU, color, size, etc. for similarity matching
- Additional fields can be stored in
metadataMappingfor retrieval, but won't affect matching - Missing required fields will generate warnings but won't stop processing
- Set optional fields to
nullor omit them if not applicable to your products
Input Parameters
The actor accepts the following input parameters:
{"datasetId": "bp0kO9SGUQckUnDJb","idField": "product.token","titlePath": "product.name.translated","brandPath": "brand.name","categoryPath": "product.details.taxonomy_type.group_name","descriptionPath": "product.details.description.translated","specificationsPath": null,"metadataMapping": {"title": "product.name.translated","brand": "brand.name","price": "product.options.{first}.retail_price_cents"},"kvStoreName": "customer-products-index","maxItems": null,"batchSize": 1000}
Required Parameters
datasetId(string): Apify dataset ID containing products to vectorizeidField(string): Dot-notation path to product ID field- Example:
"product.token","id","sku"
- Example:
titlePath(string): Path to product title fieldbrandPath(string): Path to brand name fieldcategoryPath(string): Path to category fieldkvStoreName(string): Name of Key-Value Store to save the index
Optional Parameters
descriptionPath(string, optional): Path to product description field- Highly recommended - significantly improves matching quality
specificationsPath(string, optional): Path to technical specifications- Recommended for technical products (electronics, appliances, etc.)
metadataMapping(object, optional): Fields to store as metadata- If not specified: Full dataset items are stored (preserves all data)
- If specified: Only mapped fields are stored (compact, optimized)
- Can include any fields (price, SKU, images, etc.) for retrieval
maxItems(integer, optional): Limit number of products (useful for testing)batchSize(integer, default: 1000): Products to encode per batch- Larger batches = faster but more memory
- Range: 1-10,000
debugMode(boolean, default: false): Enable verbose debug logging- When enabled: Shows detailed data structure information and extraction paths
- When disabled (recommended): Cleaner production logs with better security
- ⚠️ Warning: Debug mode may expose internal data structures in logs
Dot Notation Mapping
Basic Syntax
Extract nested fields using dot notation:
{"title": "product.name.translated","brand": "brand.name","category": "product.details.taxonomy_type.group_name"}
Given this dataset item:
{"product": {"name": {"translated": "Canvas Tote Bag"},"details": {"taxonomy_type": {"group_name": "Bags & Totes"}}},"brand": {"name": "EcoBrand"}}
Extracts:
{"title": "Canvas Tote Bag","brand": "EcoBrand","category": "Bags & Totes"}
Special Syntax: {first}
Use {first} to select the first key from a dictionary:
{"price": "product.options.{first}.retail_price_cents"}
Given:
{"product": {"options": {"opt_abc123": {"retail_price_cents": 2499},"opt_def456": {"retail_price_cents": 3499}}}}
Extracts: 2499 (from first option)
Null Values
Set a field to null to explicitly omit it:
{"title": "product.name","specifications": null}
Embedding Model
Uses a fine-tuned sentence transformer model optimized for product matching:
- Base model:
sentence-transformers/all-MiniLM-L6-v2 - Fine-tuned: On product matching task
- Format: ONNX for fast CPU inference
- Embedding dimension: 384
- Normalization: L2-normalized (cosine similarity via inner product)
Embedding Format
Products are formatted before encoding:
title: {title} | brand: {brand} | category: {category} | desc: {description} | spec: {specifications}
Important: Price is NOT included in embeddings (it's metadata only).
Output Format
The actor saves two files to the specified Key-Value Store:
1. index.faiss (Binary)
FAISS IndexFlatIP (Inner Product) containing normalized embeddings.
- Type: Inner Product index (cosine similarity for normalized vectors)
- Usage: Load with
faiss.deserialize_index(bytes)
2. metadata.json (JSON)
Complete metadata about the index:
{"version": "1.0","created_at": "2025-10-30T19:00:00Z","total_products": 104321,"embedding_dim": 384,"model": "product-matcher-onnx","embedding_mapping": {"title": "product.name.translated","brand": "brand.name"},"metadata_mapping": {"title": "product.name.translated","price": "product.options.{first}.retail_price_cents"},"ids": ["p_123", "p_456", ...],"metadata": [{"title": "Canvas Tote", "price": 2499},{"title": "Water Bottle", "price": 1999},...]}
Fields:
version: Metadata schema versioncreated_at: UTC timestamptotal_products: Number of products in indexembedding_dim: Vector dimension (384)model: Model identifierembedding_mapping: Mapping used for embeddingsmetadata_mapping: Mapping used for metadata (ornullif full items)ids: Array of product IDs (same order as FAISS index)metadata: Array of product metadata (same order as FAISS index)
Usage Examples
Example 1: Minimal Configuration (Full Metadata)
Store full dataset items as metadata, required fields only:
{"datasetId": "abc123","idField": "product.token","titlePath": "product.name.translated","brandPath": "brand.name","categoryPath": "category.name","kvStoreName": "products-full"}
Result: metadata.json contains full dataset items (preserves all data).
Example 2: Complete Configuration with Descriptions
Include all fields for best matching quality:
{"datasetId": "abc123","idField": "product.token","titlePath": "product.name.translated","brandPath": "brand.name","categoryPath": "product.details.taxonomy_type.group_name","descriptionPath": "product.details.description.translated","specificationsPath": "product.specifications","metadataMapping": {"title": "product.name.translated","brand": "brand.name","category": "product.details.taxonomy_type.group_name","price": "product.options.{first}.retail_price_cents","image": "product.image_url"},"kvStoreName": "products-complete"}
Result: Best embedding quality + compact metadata with price and image.
Example 3: Products Without Specifications
For non-technical products (clothing, home goods, etc.) that don't have specifications:
{"datasetId": "abc123","idField": "product.token","titlePath": "product.name.translated","brandPath": "brand.name","categoryPath": "product.category","descriptionPath": "product.description","specificationsPath": null,"kvStoreName": "fashion-products"}
Note: specificationsPath set to null (or omitted) since fashion products typically don't have technical specs.
Example 4: Testing with Limit
Test with 100 products (minimal setup):
{"datasetId": "abc123","idField": "id","titlePath": "name","brandPath": "brand","categoryPath": "category","kvStoreName": "test-index","maxItems": 100,"batchSize": 50}
Using the Generated Index
Python Example
from apify_client import ApifyClientimport faissimport numpy as npimport json# Initialize Apify clientclient = ApifyClient("YOUR_API_TOKEN")# Get KV storekv_store = client.key_value_store("customer-products-index")# Load indexindex_bytes = kv_store.get_record("index.faiss")["value"]index = faiss.deserialize_index(np.frombuffer(index_bytes, dtype=np.uint8))# Load metadatametadata = kv_store.get_record("metadata.json")["value"]ids = metadata["ids"]product_metadata = metadata["metadata"]print(f"Loaded index with {index.ntotal} products")# Search example (assuming you have a query embedding)query_embedding = ... # 384-dim vector, L2-normalizedk = 5 # Top 5 resultssimilarities, indices = index.search(query_embedding.reshape(1, -1).astype('float32'),k)# Get resultsfor rank, (sim, idx) in enumerate(zip(similarities[0], indices[0])):product_id = ids[idx]meta = product_metadata[idx]print(f"{rank+1}. {meta['title']} (similarity: {sim:.3f})")
Migration Recovery
The actor automatically handles server migrations:
- State Persistence: Progress is saved on
PERSIST_STATEevents - Batch Checkpoints: In-progress batches are saved before migration
- Auto-Resume: On restart, actor resumes from last checkpoint
- No Data Loss: All processed embeddings are preserved
State stored in default KV store:
vectorizer-state: Progress trackingvectorizer-batch-checkpoint: In-progress batch data
These are automatically cleaned up on successful completion.
Performance
The actor is optimized for efficient processing:
- Fast model loading and initialization
- Efficient batch vectorization
- Quick FAISS index building
- Memory usage scales with batch size
Optimization tips:
- Increase
batchSizefor faster processing (up to 10,000) - Use selective
metadataMappingto reduce memory usage - For very large datasets (>1M products), consider chunking
Files
Core Actor Files
src/main.py- Main actor entry point with migration recoverysrc/vectorizer.py- ONNX vectorizer wrappersrc/mapping.py- Dot-notation field extractionsrc/preprocessing.py- Text preprocessing utilities
Configuration
.actor/actor.json- Actor metadata.actor/input_schema.json- Input parameter schemaDockerfile- Container definitionrequirements.txt- Python dependencies
Model Files
models/product-matcher-onnx/- ONNX model filesmodel.onnx- Optimized inference modeltokenizer.json- Tokenizer configurationpooling_config.json- Pooling configuration
Deployment
- Configure input: Set dataset ID, mappings, and KV store name
- Run actor: Via Apify Console or API
- Monitor progress: Real-time status updates with ETA
- Retrieve index: Access from specified KV store
Deploy to Apify:
$apify push
Troubleshooting
Missing ID Field
Error: Missing or empty ID field at path: product.token
Solution: Check that idField path is correct and all items have IDs.
Empty Dataset
Warning: No items found in dataset
Solution: Verify dataset ID and that it contains items.
Invalid Mapping
Error: embeddingMapping cannot be empty
Solution: Provide at least one field in embeddingMapping.
Memory Issues
Error: Out of memory during batch processing
Solution: Reduce batchSize (try 500 or 250).
Related Actors
- Product Matcher: Uses this index to find similar products
- Product Scraper: Collects products for indexing
Support
For issues or questions, please create an issue in the repository.
License
MIT
