Product Matching Vectorizer avatar
Product Matching Vectorizer

Pricing

Pay per usage

Go to Apify Store
Product Matching Vectorizer

Product Matching Vectorizer

Developed by

Tri⟁angle

Tri⟁angle

Maintained by Apify

Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.

0.0 (0)

Pricing

Pay per usage

0

1

0

Last modified

a day ago

Product Matching Vectorizer - Apify Actor

Builds a FAISS vector database from products in an Apify dataset using a fine-tuned ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search.

Overview

This actor:

  1. Loads products from an Apify dataset (using pagination)
  2. Extracts fields using flexible dot-notation mapping
  3. Generates 384-dimensional embeddings using fine-tuned ONNX model
  4. Builds a FAISS index for similarity search
  5. Saves index + metadata to a named Key-Value Store

Key Features:

  • Flexible mapping: Extract nested fields with dot notation (e.g., product.name.translated)
  • Metadata options: Store full items or selective fields
  • Migration recovery: Automatic checkpoint and resume on server migration
  • Batch processing: Efficient vectorization in configurable batches
  • Progress tracking: Real-time status updates with ETA

Required vs Optional Fields

The product matching model uses exactly 5 fields for generating embeddings:

Required Fields

These fields MUST be provided for every product:

  • titlePath - Product name/title
  • brandPath - Brand name
  • categoryPath - Product category

Optional Fields

These fields improve matching quality when available:

  • descriptionPath - Product description (highly recommended for all products)
  • specificationsPath - Technical specifications (highly recommended for technical products)

When to use specifications:

  • Electronics - Screen size, processor, RAM, storage, etc.
  • Appliances - Dimensions, power, capacity, features
  • Technical gear - Materials, measurements, technical details
  • Fashion/Clothing - Size/color go in metadata, not embeddings
  • Simple products - Most non-technical products don't need this

Important Notes:

  • The model does NOT use other fields like price, SKU, color, size, etc. for similarity matching
  • Additional fields can be stored in metadataMapping for retrieval, but won't affect matching
  • Missing required fields will generate warnings but won't stop processing
  • Set optional fields to null or omit them if not applicable to your products

Input Parameters

The actor accepts the following input parameters:

{
"datasetId": "bp0kO9SGUQckUnDJb",
"idField": "product.token",
"titlePath": "product.name.translated",
"brandPath": "brand.name",
"categoryPath": "product.details.taxonomy_type.group_name",
"descriptionPath": "product.details.description.translated",
"specificationsPath": null,
"metadataMapping": {
"title": "product.name.translated",
"brand": "brand.name",
"price": "product.options.{first}.retail_price_cents"
},
"kvStoreName": "customer-products-index",
"maxItems": null,
"batchSize": 1000
}

Required Parameters

  • datasetId (string): Apify dataset ID containing products to vectorize
  • idField (string): Dot-notation path to product ID field
    • Example: "product.token", "id", "sku"
  • titlePath (string): Path to product title field
  • brandPath (string): Path to brand name field
  • categoryPath (string): Path to category field
  • kvStoreName (string): Name of Key-Value Store to save the index

Optional Parameters

  • descriptionPath (string, optional): Path to product description field
    • Highly recommended - significantly improves matching quality
  • specificationsPath (string, optional): Path to technical specifications
    • Recommended for technical products (electronics, appliances, etc.)
  • metadataMapping (object, optional): Fields to store as metadata
    • If not specified: Full dataset items are stored (preserves all data)
    • If specified: Only mapped fields are stored (compact, optimized)
    • Can include any fields (price, SKU, images, etc.) for retrieval
  • maxItems (integer, optional): Limit number of products (useful for testing)
  • batchSize (integer, default: 1000): Products to encode per batch
    • Larger batches = faster but more memory
    • Range: 1-10,000
  • debugMode (boolean, default: false): Enable verbose debug logging
    • When enabled: Shows detailed data structure information and extraction paths
    • When disabled (recommended): Cleaner production logs with better security
    • ⚠️ Warning: Debug mode may expose internal data structures in logs

Dot Notation Mapping

Basic Syntax

Extract nested fields using dot notation:

{
"title": "product.name.translated",
"brand": "brand.name",
"category": "product.details.taxonomy_type.group_name"
}

Given this dataset item:

{
"product": {
"name": {"translated": "Canvas Tote Bag"},
"details": {
"taxonomy_type": {"group_name": "Bags & Totes"}
}
},
"brand": {"name": "EcoBrand"}
}

Extracts:

{
"title": "Canvas Tote Bag",
"brand": "EcoBrand",
"category": "Bags & Totes"
}

Special Syntax: {first}

Use {first} to select the first key from a dictionary:

{
"price": "product.options.{first}.retail_price_cents"
}

Given:

{
"product": {
"options": {
"opt_abc123": {"retail_price_cents": 2499},
"opt_def456": {"retail_price_cents": 3499}
}
}
}

Extracts: 2499 (from first option)

Null Values

Set a field to null to explicitly omit it:

{
"title": "product.name",
"specifications": null
}

Embedding Model

Uses a fine-tuned sentence transformer model optimized for product matching:

  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Fine-tuned: On product matching task
  • Format: ONNX for fast CPU inference
  • Embedding dimension: 384
  • Normalization: L2-normalized (cosine similarity via inner product)

Embedding Format

Products are formatted before encoding:

title: {title} | brand: {brand} | category: {category} | desc: {description} | spec: {specifications}

Important: Price is NOT included in embeddings (it's metadata only).

Output Format

The actor saves two files to the specified Key-Value Store:

1. index.faiss (Binary)

FAISS IndexFlatIP (Inner Product) containing normalized embeddings.

  • Type: Inner Product index (cosine similarity for normalized vectors)
  • Usage: Load with faiss.deserialize_index(bytes)

2. metadata.json (JSON)

Complete metadata about the index:

{
"version": "1.0",
"created_at": "2025-10-30T19:00:00Z",
"total_products": 104321,
"embedding_dim": 384,
"model": "product-matcher-onnx",
"embedding_mapping": {
"title": "product.name.translated",
"brand": "brand.name"
},
"metadata_mapping": {
"title": "product.name.translated",
"price": "product.options.{first}.retail_price_cents"
},
"ids": ["p_123", "p_456", ...],
"metadata": [
{"title": "Canvas Tote", "price": 2499},
{"title": "Water Bottle", "price": 1999},
...
]
}

Fields:

  • version: Metadata schema version
  • created_at: UTC timestamp
  • total_products: Number of products in index
  • embedding_dim: Vector dimension (384)
  • model: Model identifier
  • embedding_mapping: Mapping used for embeddings
  • metadata_mapping: Mapping used for metadata (or null if full items)
  • ids: Array of product IDs (same order as FAISS index)
  • metadata: Array of product metadata (same order as FAISS index)

Usage Examples

Example 1: Minimal Configuration (Full Metadata)

Store full dataset items as metadata, required fields only:

{
"datasetId": "abc123",
"idField": "product.token",
"titlePath": "product.name.translated",
"brandPath": "brand.name",
"categoryPath": "category.name",
"kvStoreName": "products-full"
}

Result: metadata.json contains full dataset items (preserves all data).

Example 2: Complete Configuration with Descriptions

Include all fields for best matching quality:

{
"datasetId": "abc123",
"idField": "product.token",
"titlePath": "product.name.translated",
"brandPath": "brand.name",
"categoryPath": "product.details.taxonomy_type.group_name",
"descriptionPath": "product.details.description.translated",
"specificationsPath": "product.specifications",
"metadataMapping": {
"title": "product.name.translated",
"brand": "brand.name",
"category": "product.details.taxonomy_type.group_name",
"price": "product.options.{first}.retail_price_cents",
"image": "product.image_url"
},
"kvStoreName": "products-complete"
}

Result: Best embedding quality + compact metadata with price and image.

Example 3: Products Without Specifications

For non-technical products (clothing, home goods, etc.) that don't have specifications:

{
"datasetId": "abc123",
"idField": "product.token",
"titlePath": "product.name.translated",
"brandPath": "brand.name",
"categoryPath": "product.category",
"descriptionPath": "product.description",
"specificationsPath": null,
"kvStoreName": "fashion-products"
}

Note: specificationsPath set to null (or omitted) since fashion products typically don't have technical specs.

Example 4: Testing with Limit

Test with 100 products (minimal setup):

{
"datasetId": "abc123",
"idField": "id",
"titlePath": "name",
"brandPath": "brand",
"categoryPath": "category",
"kvStoreName": "test-index",
"maxItems": 100,
"batchSize": 50
}

Using the Generated Index

Python Example

from apify_client import ApifyClient
import faiss
import numpy as np
import json
# Initialize Apify client
client = ApifyClient("YOUR_API_TOKEN")
# Get KV store
kv_store = client.key_value_store("customer-products-index")
# Load index
index_bytes = kv_store.get_record("index.faiss")["value"]
index = faiss.deserialize_index(np.frombuffer(index_bytes, dtype=np.uint8))
# Load metadata
metadata = kv_store.get_record("metadata.json")["value"]
ids = metadata["ids"]
product_metadata = metadata["metadata"]
print(f"Loaded index with {index.ntotal} products")
# Search example (assuming you have a query embedding)
query_embedding = ... # 384-dim vector, L2-normalized
k = 5 # Top 5 results
similarities, indices = index.search(
query_embedding.reshape(1, -1).astype('float32'),
k
)
# Get results
for rank, (sim, idx) in enumerate(zip(similarities[0], indices[0])):
product_id = ids[idx]
meta = product_metadata[idx]
print(f"{rank+1}. {meta['title']} (similarity: {sim:.3f})")

Migration Recovery

The actor automatically handles server migrations:

  1. State Persistence: Progress is saved on PERSIST_STATE events
  2. Batch Checkpoints: In-progress batches are saved before migration
  3. Auto-Resume: On restart, actor resumes from last checkpoint
  4. No Data Loss: All processed embeddings are preserved

State stored in default KV store:

  • vectorizer-state: Progress tracking
  • vectorizer-batch-checkpoint: In-progress batch data

These are automatically cleaned up on successful completion.

Performance

The actor is optimized for efficient processing:

  • Fast model loading and initialization
  • Efficient batch vectorization
  • Quick FAISS index building
  • Memory usage scales with batch size

Optimization tips:

  • Increase batchSize for faster processing (up to 10,000)
  • Use selective metadataMapping to reduce memory usage
  • For very large datasets (>1M products), consider chunking

Files

Core Actor Files

  • src/main.py - Main actor entry point with migration recovery
  • src/vectorizer.py - ONNX vectorizer wrapper
  • src/mapping.py - Dot-notation field extraction
  • src/preprocessing.py - Text preprocessing utilities

Configuration

  • .actor/actor.json - Actor metadata
  • .actor/input_schema.json - Input parameter schema
  • Dockerfile - Container definition
  • requirements.txt - Python dependencies

Model Files

  • models/product-matcher-onnx/ - ONNX model files
    • model.onnx - Optimized inference model
    • tokenizer.json - Tokenizer configuration
    • pooling_config.json - Pooling configuration

Deployment

  1. Configure input: Set dataset ID, mappings, and KV store name
  2. Run actor: Via Apify Console or API
  3. Monitor progress: Real-time status updates with ETA
  4. Retrieve index: Access from specified KV store

Deploy to Apify:

$apify push

Troubleshooting

Missing ID Field

Error: Missing or empty ID field at path: product.token

Solution: Check that idField path is correct and all items have IDs.

Empty Dataset

Warning: No items found in dataset

Solution: Verify dataset ID and that it contains items.

Invalid Mapping

Error: embeddingMapping cannot be empty

Solution: Provide at least one field in embeddingMapping.

Memory Issues

Error: Out of memory during batch processing

Solution: Reduce batchSize (try 500 or 250).

  • Product Matcher: Uses this index to find similar products
  • Product Scraper: Collects products for indexing

Support

For issues or questions, please create an issue in the repository.

License

MIT