# Vector Embeddings Generator (`mhamas/vector-embeddings-generator`) Actor

Turn any text into semantic embedding vectors — perfect for search, similarity matching, clustering, and recommendations. Just feed your texts as JSON or a URL and get 768-dimensional vectors back. Powered by nomic-embed-text-v1.5 with 8K token context. No GPU needed.

- **URL**: https://apify.com/mhamas/vector-embeddings-generator.md
- **Developed by:** [Matej Hamas](https://apify.com/mhamas) (community)
- **Categories:** Automation, Other
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Vector Embeddings Generator

An [Apify Actor](https://apify.com/actors) that converts text into 768-dimensional embedding vectors. Provide a JSON object of key-value pairs (or a URL pointing to one), and the Actor returns a matching object where each key maps to its embedding vector, stored in the default key-value store.

### What are text embeddings?

Text embeddings are numerical representations of text that capture semantic meaning. Similar texts produce vectors that are close together in a high-dimensional space, which lets you compare meaning mathematically rather than relying on exact keyword matches.

### Use cases

- **Semantic search** -- find results relevant to a query even when the wording differs
- **Similarity matching** -- measure how closely related two pieces of text are
- **Clustering** -- group related texts automatically by vector proximity
- **Deduplication** -- detect near-duplicate content regardless of phrasing
- **Recommendations** -- suggest similar items based on description similarity

### Model

This Actor uses [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) via [FastEmbed](https://github.com/qdrant/fastembed), a lightweight ONNX-based inference library optimized for CPU.

| Property | Value |
|---|---|
| Dimensions | 768 |
| Max sequence length | 8,192 tokens (~6,000 English words) |
| Language | English |
| Similarity metric | Cosine similarity (or dot product -- vectors are L2-normalized) |

Because the output vectors are L2-normalized (unit length), cosine similarity and dot product produce identical results -- use whichever your downstream tool expects.

### Input

The Actor accepts three parameters:

#### `jsonData` or `jsonUrl` (required -- provide exactly one)

**`jsonData`** -- A JSON object where keys are identifiers and values are the texts to embed.

**`jsonUrl`** -- A publicly accessible URL that returns a JSON object in the same key-value format.

```json
{
    "jsonData": {
        "product_a": "A lightweight running shoe for daily training",
        "product_b": "Heavy duty waterproof hiking boots",
        "product_c": "Casual summer sandal for the beach"
    },
    "taskType": "search_document"
}
````

- Provide `jsonData` or `jsonUrl`, not both.
- All values must be strings; keys can be any string and are preserved as-is in the output.
- The object must contain at least one entry.
- When using `jsonUrl`, the URL must be publicly accessible and return raw JSON (not an HTML page).
- Each text value can be up to **8,192 tokens** long (roughly **6,000 English words**). Longer texts are truncated by the model.

**Using Google Drive as a JSON source:** Google Drive share links (`https://drive.google.com/file/d/FILE_ID/view?usp=sharing`) return an HTML preview page, not raw JSON. To get the direct download URL, extract the `FILE_ID` from the share link and use this format instead:

```
https://drive.google.com/uc?export=download&id=FILE_ID
```

#### `taskType` (optional, default: `search_document`)

The nomic model optimizes embeddings differently depending on the intended use case. The selected task type is prepended to each text internally before embedding.

| Task type | When to use |
|---|---|
| `search_document` | Embedding content that will be searched against -- product descriptions, articles, knowledge base entries. |
| `search_query` | Embedding the user's search query. For best retrieval accuracy, embed your documents with `search_document` and your queries with `search_query`. |
| `clustering` | Grouping texts by similarity -- topic detection, organizing collections of documents. |
| `classification` | Feeding embeddings into a classifier that assigns labels or categories to texts. |

Embeddings generated with different task types are not directly comparable -- always use the same task type for texts you intend to compare, except for the `search_document` / `search_query` pair which is designed to work together.

### Output

Results are stored in the default run key-value store under the key `embeddings`. The output mirrors the input structure: each key maps to a 768-element array of floats. The vectors are L2-normalized (unit length), so you can use dot product directly as cosine similarity.

```json
{
    "product_a": [0.0123, -0.0456, 0.0789, "... (768 floats)"],
    "product_b": [-0.0321, 0.0654, -0.0987, "... (768 floats)"],
    "product_c": [0.0111, -0.0222, 0.0333, "... (768 floats)"]
}
```

### Technology

The Actor is built with the [Apify SDK for Python](https://docs.apify.com/sdk/python) and runs the [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) model through [FastEmbed](https://github.com/qdrant/fastembed), a lightweight inference library from Qdrant. FastEmbed ships a pre-converted ONNX version of the model, so the Actor needs neither PyTorch nor GPU drivers. At runtime, FastEmbed downloads and caches the ONNX weights, tokenizes the input, runs inference via ONNX Runtime on CPU, and returns normalized vectors. This keeps the Docker image small (~0.5-1 GB compared to ~5 GB for PyTorch-based alternatives).

### Limitations

- **English only** -- Other languages will produce lower-quality embeddings.
- **Token limit** -- Texts exceeding ~8,192 tokens (~6,000 English words) are truncated. Split long documents into chunks before embedding.
- **Memory** -- The ONNX model alone requires ~520 MB. With the default batch size of 16, total memory usage stays around 1-2 GB regardless of input size (larger inputs just take more batches). Choose an Apify memory tier of 2 GB or above.
- **CPU inference** -- The first batch (~16 texts) takes up to a minute due to ONNX Runtime warm-up. Subsequent batches are much faster. Embedding 1,000 short texts takes roughly 1-3 seconds after warm-up. Very large inputs (10,000+ texts) scale linearly; consider splitting across multiple runs.
- **Output size** -- Each embedding is 768 floats. At 10,000 keys the output JSON is approximately 150-200 MB.

# Actor input Schema

## `jsonData` (type: `object`):

A JSON object mapping string keys to string values. Each value will be embedded. Example: {"product1": "A lightweight running shoe", "product2": "Heavy duty hiking boots"}

## `jsonUrl` (type: `string`):

A public URL pointing to a JSON file with string keys and string values. The URL must return a JSON response.

## `taskType` (type: `string`):

The nomic embedding model optimizes embeddings for different use cases based on the selected task type.

• search\_document (default) — Use when embedding content that will be searched against. For example, product descriptions, articles, or knowledge base entries.
• search\_query — Use when embedding the search query itself. If you build a search system, embed your documents with 'search\_document' and your queries with 'search\_query' for best retrieval accuracy.
• clustering — Use when grouping texts by similarity, such as topic detection or organizing a collection of documents.
• classification — Use when the embeddings feed into a classifier that assigns labels or categories to texts.

## Actor input object example

```json
{
  "taskType": "search_document"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "taskType": "search_document"
};

// Run the Actor and wait for it to finish
const run = await client.actor("mhamas/vector-embeddings-generator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "taskType": "search_document" }

# Run the Actor and wait for it to finish
run = client.actor("mhamas/vector-embeddings-generator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "taskType": "search_document"
}' |
apify call mhamas/vector-embeddings-generator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=mhamas/vector-embeddings-generator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Vector Embeddings Generator",
        "description": "Turn any text into semantic embedding vectors — perfect for search, similarity matching, clustering, and recommendations. Just feed your texts as JSON or a URL and get 768-dimensional vectors back. Powered by nomic-embed-text-v1.5 with 8K token context. No GPU needed.",
        "version": "0.0",
        "x-build-id": "Ygj0eayP6fswswaf6"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/mhamas~vector-embeddings-generator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-mhamas-vector-embeddings-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/mhamas~vector-embeddings-generator/runs": {
            "post": {
                "operationId": "runs-sync-mhamas-vector-embeddings-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/mhamas~vector-embeddings-generator/run-sync": {
            "post": {
                "operationId": "run-sync-mhamas-vector-embeddings-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "jsonData": {
                        "title": "JSON Data",
                        "type": "object",
                        "description": "A JSON object mapping string keys to string values. Each value will be embedded. Example: {\"product1\": \"A lightweight running shoe\", \"product2\": \"Heavy duty hiking boots\"}"
                    },
                    "jsonUrl": {
                        "title": "JSON URL",
                        "type": "string",
                        "description": "A public URL pointing to a JSON file with string keys and string values. The URL must return a JSON response."
                    },
                    "taskType": {
                        "title": "Task Type",
                        "enum": [
                            "search_document",
                            "search_query",
                            "clustering",
                            "classification"
                        ],
                        "type": "string",
                        "description": "The nomic embedding model optimizes embeddings for different use cases based on the selected task type.\n\n• search_document (default) — Use when embedding content that will be searched against. For example, product descriptions, articles, or knowledge base entries.\n• search_query — Use when embedding the search query itself. If you build a search system, embed your documents with 'search_document' and your queries with 'search_query' for best retrieval accuracy.\n• clustering — Use when grouping texts by similarity, such as topic detection or organizing a collection of documents.\n• classification — Use when the embeddings feed into a classifier that assigns labels or categories to texts.",
                        "default": "search_document"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
