Product Matching Vectorizer

Pricing

Pay per usage

Product Matching Vectorizer

Builds a FAISS vector database from products in an Apify dataset using an ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search. After uploading your dataset to the vector database, use our E-commerce Product Matching Tool to find matching products.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Tri⟁angle

Maintained by Apify

Actor stats

Bookmarked

Total users

Monthly active users

24 days ago

Last modified

Product Matching Vectorizer - Apify Actor

Builds a FAISS vector database from products in an Apify dataset using a fine-tuned ONNX embedding model. The resulting index is saved to a Key-Value Store for fast similarity search.

Overview

This actor:

Loads products from an Apify dataset (using pagination)
Extracts fields using flexible dot-notation mapping
Generates 384-dimensional embeddings using fine-tuned ONNX model
Builds a FAISS index for similarity search
Saves index + metadata to a named Key-Value Store

Key Features:

Flexible mapping: Extract nested fields with dot notation (e.g., product.name.translated)
Metadata options: Store full items or selective fields
Migration recovery: Automatic checkpoint and resume on server migration
Batch processing: Efficient vectorization in configurable batches
Progress tracking: Real-time status updates with ETA

Required vs Optional Fields

The product matching model uses exactly 5 fields for generating embeddings:

Required Fields

These fields MUST be provided for every product:

titlePath - Product name/title
brandPath - Brand name
categoryPath - Product category

Optional Fields

These fields improve matching quality when available:

descriptionPath - Product description (highly recommended for all products)
specificationsPath - Technical specifications (highly recommended for technical products)

When to use specifications:

Electronics - Screen size, processor, RAM, storage, etc.
Appliances - Dimensions, power, capacity, features
Technical gear - Materials, measurements, technical details
Fashion/Clothing - Size/color go in metadata, not embeddings
Simple products - Most non-technical products don't need this

Important Notes:

The model does NOT use other fields like price, SKU, color, size, etc. for similarity matching
Additional fields can be stored in metadataMapping for retrieval, but won't affect matching
Missing required fields will generate warnings but won't stop processing
Set optional fields to null or omit them if not applicable to your products

Input Parameters

The actor accepts the following input parameters:

{
  "datasetId": "bp0kO9SGUQckUnDJb",
  "idField": "product.token",
  "titlePath": "product.name.translated",
  "brandPath": "brand.name",
  "categoryPath": "product.details.taxonomy_type.group_name",
  "descriptionPath": "product.details.description.translated",
  "specificationsPath": null,
  "metadataMapping": {
    "title": "product.name.translated",
    "brand": "brand.name",
    "price": "product.options.{first}.retail_price_cents"
  },
  "kvStoreName": "customer-products-index",
  "maxItems": null,
  "batchSize": 1000
}

Required Parameters

datasetId (string): Apify dataset ID containing products to vectorize
idField (string): Dot-notation path to product ID field
- Example: "product.token", "id", "sku"
titlePath (string): Path to product title field
brandPath (string): Path to brand name field
categoryPath (string): Path to category field
kvStoreName (string): Name of Key-Value Store to save the index

Optional Parameters

descriptionPath (string, optional): Path to product description field
- Highly recommended - significantly improves matching quality
specificationsPath (string, optional): Path to technical specifications
- Recommended for technical products (electronics, appliances, etc.)
metadataMapping (object, optional): Fields to store as metadata
- If not specified: Full dataset items are stored (preserves all data)
- If specified: Only mapped fields are stored (compact, optimized)
- Can include any fields (price, SKU, images, etc.) for retrieval
maxItems (integer, optional): Limit number of products (useful for testing)
batchSize (integer, default: 1000): Products to encode per batch
- Larger batches = faster but more memory
- Range: 1-10,000
debugMode (boolean, default: false): Enable verbose debug logging
- When enabled: Shows detailed data structure information and extraction paths
- When disabled (recommended): Cleaner production logs with better security
- ⚠️ Warning: Debug mode may expose internal data structures in logs

Dot Notation Mapping

Basic Syntax

Extract nested fields using dot notation:

{
  "title": "product.name.translated",
  "brand": "brand.name",
  "category": "product.details.taxonomy_type.group_name"
}

Given this dataset item:

{
  "product": {
    "name": {"translated": "Canvas Tote Bag"},
    "details": {
      "taxonomy_type": {"group_name": "Bags & Totes"}
    }
  },
  "brand": {"name": "EcoBrand"}
}

Extracts:

{
  "title": "Canvas Tote Bag",
  "brand": "EcoBrand",
  "category": "Bags & Totes"
}

Special Syntax: `{first}`

Use {first} to select the first key from a dictionary:

{
  "price": "product.options.{first}.retail_price_cents"
}

Given:

{
  "product": {
    "options": {
      "opt_abc123": {"retail_price_cents": 2499},
      "opt_def456": {"retail_price_cents": 3499}
    }
  }
}

Extracts: 2499 (from first option)

Null Values

Set a field to null to explicitly omit it:

{
  "title": "product.name",
  "specifications": null
}

Embedding Model

Uses a fine-tuned sentence transformer model optimized for product matching:

Base model: sentence-transformers/all-MiniLM-L6-v2
Fine-tuned: On product matching task
Format: ONNX for fast CPU inference
Embedding dimension: 384
Normalization: L2-normalized (cosine similarity via inner product)

Embedding Format

Products are formatted before encoding:

title: {title} | brand: {brand} | category: {category} | desc: {description} | spec: {specifications}

Important: Price is NOT included in embeddings (it's metadata only).

Output Format

The actor saves two files to the specified Key-Value Store:

1. `index.faiss` (Binary)

FAISS IndexFlatIP (Inner Product) containing normalized embeddings.

Type: Inner Product index (cosine similarity for normalized vectors)
Usage: Load with faiss.deserialize_index(bytes)

2. `metadata.json` (JSON)

Complete metadata about the index:

{
  "version": "1.0",
  "created_at": "2025-10-30T19:00:00Z",
  "total_products": 104321,
  "embedding_dim": 384,
  "model": "product-matcher-onnx",
  "embedding_mapping": {
    "title": "product.name.translated",
    "brand": "brand.name"
  },
  "metadata_mapping": {
    "title": "product.name.translated",
    "price": "product.options.{first}.retail_price_cents"
  },
  "ids": ["p_123", "p_456", ...],
  "metadata": [
    {"title": "Canvas Tote", "price": 2499},
    {"title": "Water Bottle", "price": 1999},
    ...
  ]
}

Fields:

version: Metadata schema version
created_at: UTC timestamp
total_products: Number of products in index
embedding_dim: Vector dimension (384)
model: Model identifier
embedding_mapping: Mapping used for embeddings
metadata_mapping: Mapping used for metadata (or null if full items)
ids: Array of product IDs (same order as FAISS index)
metadata: Array of product metadata (same order as FAISS index)

Usage Examples

Example 1: Minimal Configuration (Full Metadata)

Store full dataset items as metadata, required fields only:

{
  "datasetId": "abc123",
  "idField": "product.token",
  "titlePath": "product.name.translated",
  "brandPath": "brand.name",
  "categoryPath": "category.name",
  "kvStoreName": "products-full"
}

Result: metadata.json contains full dataset items (preserves all data).

Example 2: Complete Configuration with Descriptions

Include all fields for best matching quality:

{
  "datasetId": "abc123",
  "idField": "product.token",
  "titlePath": "product.name.translated",
  "brandPath": "brand.name",
  "categoryPath": "product.details.taxonomy_type.group_name",
  "descriptionPath": "product.details.description.translated",
  "specificationsPath": "product.specifications",
  "metadataMapping": {
    "title": "product.name.translated",
    "brand": "brand.name",
    "category": "product.details.taxonomy_type.group_name",
    "price": "product.options.{first}.retail_price_cents",
    "image": "product.image_url"
  },
  "kvStoreName": "products-complete"
}

Result: Best embedding quality + compact metadata with price and image.

Example 3: Products Without Specifications

For non-technical products (clothing, home goods, etc.) that don't have specifications:

{
  "datasetId": "abc123",
  "idField": "product.token",
  "titlePath": "product.name.translated",
  "brandPath": "brand.name",
  "categoryPath": "product.category",
  "descriptionPath": "product.description",
  "specificationsPath": null,
  "kvStoreName": "fashion-products"
}

Note: specificationsPath set to null (or omitted) since fashion products typically don't have technical specs.

Example 4: Testing with Limit

Test with 100 products (minimal setup):

{
  "datasetId": "abc123",
  "idField": "id",
  "titlePath": "name",
  "brandPath": "brand",
  "categoryPath": "category",
  "kvStoreName": "test-index",
  "maxItems": 100,
  "batchSize": 50
}

Using the Generated Index

Python Example

from apify_client import ApifyClient
import faiss
import numpy as np
import json

# Initialize Apify client
client = ApifyClient("YOUR_API_TOKEN")

# Get KV store
kv_store = client.key_value_store("customer-products-index")

# Load index
index_bytes = kv_store.get_record("index.faiss")["value"]
index = faiss.deserialize_index(np.frombuffer(index_bytes, dtype=np.uint8))

# Load metadata
metadata = kv_store.get_record("metadata.json")["value"]
ids = metadata["ids"]
product_metadata = metadata["metadata"]

print(f"Loaded index with {index.ntotal} products")

# Search example (assuming you have a query embedding)
query_embedding = ...  # 384-dim vector, L2-normalized
k = 5  # Top 5 results

similarities, indices = index.search(
    query_embedding.reshape(1, -1).astype('float32'),
    k
)

# Get results
for rank, (sim, idx) in enumerate(zip(similarities[0], indices[0])):
    product_id = ids[idx]
    meta = product_metadata[idx]
    print(f"{rank+1}. {meta['title']} (similarity: {sim:.3f})")

Migration Recovery

The actor automatically handles server migrations:

State Persistence: Progress is saved on PERSIST_STATE events
Batch Checkpoints: In-progress batches are saved before migration
Auto-Resume: On restart, actor resumes from last checkpoint
No Data Loss: All processed embeddings are preserved

State stored in default KV store:

vectorizer-state: Progress tracking
vectorizer-batch-checkpoint: In-progress batch data

These are automatically cleaned up on successful completion.

Performance

The actor is optimized for efficient processing:

Fast model loading and initialization
Efficient batch vectorization
Quick FAISS index building
Memory usage scales with batch size

Optimization tips:

Increase batchSize for faster processing (up to 10,000)
Use selective metadataMapping to reduce memory usage
For very large datasets (>1M products), consider chunking

Files

Core Actor Files

src/main.py - Main actor entry point with migration recovery
src/vectorizer.py - ONNX vectorizer wrapper
src/mapping.py - Dot-notation field extraction
src/preprocessing.py - Text preprocessing utilities

Configuration

.actor/actor.json - Actor metadata
.actor/input_schema.json - Input parameter schema
Dockerfile - Container definition
requirements.txt - Python dependencies

Model Files

models/product-matcher-onnx/ - ONNX model files
- model.onnx - Optimized inference model
- tokenizer.json - Tokenizer configuration
- pooling_config.json - Pooling configuration

Deployment

Configure input: Set dataset ID, mappings, and KV store name
Run actor: Via Apify Console or API
Monitor progress: Real-time status updates with ETA
Retrieve index: Access from specified KV store

Deploy to Apify:

$apify push

Troubleshooting

Missing ID Field

Error: Missing or empty ID field at path: product.token

Solution: Check that idField path is correct and all items have IDs.

Empty Dataset

Warning: No items found in dataset

Solution: Verify dataset ID and that it contains items.

Invalid Mapping

Error: embeddingMapping cannot be empty

Solution: Provide at least one field in embeddingMapping.

Memory Issues

Error: Out of memory during batch processing

Solution: Reduce batchSize (try 500 or 250).

Product Matcher: Uses this index to find similar products
Product Scraper: Collects products for indexing

Support

For issues or questions, please create an issue in the repository.

License

MIT

E-commerce Product Matching Tool

tri_angle/e-commerce-product-matching-tool

Quickly find and rank matching products from two sources using intelligent similarity search. This actor works with pre-built product data to identify the best matches. Use it after uploading your dataset to the vector database with the Product Matching Vectorizer.

Tri⟁angle

Faire Product Details Scraper

tri_angle/faire-product-details-scraper

Use this scraper to collect data from the Faire marketplace. Extract detailed product information, including prices, descriptions, images, and in-stock availability. Download the data in multiple structured formats for easy analysis and integration.

Tri⟁angle

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

Matěj Sochor

684

Truth Social Scraper

tri_angle/truth-scraper

Scrape profile info, truths and replies from the Truth social media platform.

Tri⟁angle

226

5.0

Sitemap Change Orchestrator

tri_angle/sitemap-change-orchestrator

Monitor website sitemaps for new, updated, or removed URLs. Integration with the Website Content Crawler (WCC) allows feeding only relevant URLs. This ensures your web crawls are efficient, targeted, and resource-optimized, keeping your datasets fresh for any application.

Tri⟁angle

Woolworths Reviews Scraper

tri_angle/woolworths-reviews-scraper

Scrape product reviews from Woolworths. This actor covers both Australia and New Zealand domains.

Tri⟁angle

Faire Product Scraper

powerai/faire-search-scraper

Scrape wholesale products from Faire.com with automatic pagination and comprehensive product, brand, and review data.

PowerAI

5.0

YellowPages Australia Lead Generator

delicious_zebu/yellowpages-australia-lead-generator

Effortlessly scrape detailed business data from YellowPages.com.au by keyword, location, and filters like “Open Now” or “Popular.” Fast, flexible, and ideal for lead generation or market research.

ВAH

5.0

Truth Social Scraper | All-In-One | $12 / mo

fatihtahta/truth-social-scraper-all-in-one

The all-in-one Truth Social scraper. Extract detailed data from profiles, posts, replies, and full comment threads using search queries or direct URLs. This enterprise-grade tool delivers clean, structured data for research and analysis. No cookies needed.

Fatih Tahta

5.0

Truth Social Post Extractor

sandaliaapps/truthsocial-post-extractor

Easily extract and collect data from Truth Social posts using the Truth Social Posts Extractor Apify Actor. This powerful and efficient web scraping tool is designed to help you gather valuable insights from Truth Social quickly and seamlessly.

Sandalia Apps

Product Matching Vectorizer

Product Matching Vectorizer

Product Matching Vectorizer - Apify Actor

Overview

Required vs Optional Fields

Required Fields

Optional Fields

Input Parameters

Required Parameters

Optional Parameters

Dot Notation Mapping

Basic Syntax

Special Syntax: {first}

Null Values

Embedding Model

Embedding Format

Output Format

1. index.faiss (Binary)

2. metadata.json (JSON)

Usage Examples

Example 1: Minimal Configuration (Full Metadata)

Example 2: Complete Configuration with Descriptions

Example 3: Products Without Specifications

Example 4: Testing with Limit

Using the Generated Index

Python Example

Migration Recovery

Performance

Files

Core Actor Files

Configuration

Model Files

Deployment

Troubleshooting

Missing ID Field

Empty Dataset

Invalid Mapping

Memory Issues

Related Actors

Support

License

You might also like

E-commerce Product Matching Tool

Faire Product Details Scraper

AI Product Matcher

Truth Social Scraper

Sitemap Change Orchestrator

Woolworths Reviews Scraper

Faire Product Scraper

YellowPages Australia Lead Generator

Truth Social Scraper | All-In-One | $12 / mo

Truth Social Post Extractor

Related articles

Special Syntax: `{first}`

1. `index.faiss` (Binary)

2. `metadata.json` (JSON)