Vector Embeddings Generator avatar

Vector Embeddings Generator

Under maintenance

Pricing

Pay per usage

Go to Apify Store
Vector Embeddings Generator

Vector Embeddings Generator

Under maintenance

Turn any text into semantic embedding vectors — perfect for search, similarity matching, clustering, and recommendations. Just feed your texts as JSON or a URL and get 768-dimensional vectors back. Powered by nomic-embed-text-v1.5 with 8K token context. No GPU needed.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Matej Hamas

Matej Hamas

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 days ago

Last modified

Categories

Share

An Apify Actor that converts text into 768-dimensional embedding vectors. Provide a JSON object of key-value pairs (or a URL pointing to one), and the Actor returns a matching object where each key maps to its embedding vector, stored in the default key-value store.

What are text embeddings?

Text embeddings are numerical representations of text that capture semantic meaning. Similar texts produce vectors that are close together in a high-dimensional space, which lets you compare meaning mathematically rather than relying on exact keyword matches.

Use cases

  • Semantic search -- find results relevant to a query even when the wording differs
  • Similarity matching -- measure how closely related two pieces of text are
  • Clustering -- group related texts automatically by vector proximity
  • Deduplication -- detect near-duplicate content regardless of phrasing
  • Recommendations -- suggest similar items based on description similarity

Model

This Actor uses nomic-ai/nomic-embed-text-v1.5 via FastEmbed, a lightweight ONNX-based inference library optimized for CPU.

PropertyValue
Dimensions768
Max sequence length8,192 tokens (~6,000 English words)
LanguageEnglish
Similarity metricCosine similarity (or dot product -- vectors are L2-normalized)

Because the output vectors are L2-normalized (unit length), cosine similarity and dot product produce identical results -- use whichever your downstream tool expects.

Input

The Actor accepts three parameters:

jsonData or jsonUrl (required -- provide exactly one)

jsonData -- A JSON object where keys are identifiers and values are the texts to embed.

jsonUrl -- A publicly accessible URL that returns a JSON object in the same key-value format.

{
"jsonData": {
"product_a": "A lightweight running shoe for daily training",
"product_b": "Heavy duty waterproof hiking boots",
"product_c": "Casual summer sandal for the beach"
},
"taskType": "search_document"
}
  • Provide jsonData or jsonUrl, not both.
  • All values must be strings; keys can be any string and are preserved as-is in the output.
  • The object must contain at least one entry.
  • When using jsonUrl, the URL must be publicly accessible and return raw JSON (not an HTML page).
  • Each text value can be up to 8,192 tokens long (roughly 6,000 English words). Longer texts are truncated by the model.

Using Google Drive as a JSON source: Google Drive share links (https://drive.google.com/file/d/FILE_ID/view?usp=sharing) return an HTML preview page, not raw JSON. To get the direct download URL, extract the FILE_ID from the share link and use this format instead:

https://drive.google.com/uc?export=download&id=FILE_ID

taskType (optional, default: search_document)

The nomic model optimizes embeddings differently depending on the intended use case. The selected task type is prepended to each text internally before embedding.

Task typeWhen to use
search_documentEmbedding content that will be searched against -- product descriptions, articles, knowledge base entries.
search_queryEmbedding the user's search query. For best retrieval accuracy, embed your documents with search_document and your queries with search_query.
clusteringGrouping texts by similarity -- topic detection, organizing collections of documents.
classificationFeeding embeddings into a classifier that assigns labels or categories to texts.

Embeddings generated with different task types are not directly comparable -- always use the same task type for texts you intend to compare, except for the search_document / search_query pair which is designed to work together.

Output

Results are stored in the default run key-value store under the key embeddings. The output mirrors the input structure: each key maps to a 768-element array of floats. The vectors are L2-normalized (unit length), so you can use dot product directly as cosine similarity.

{
"product_a": [0.0123, -0.0456, 0.0789, "... (768 floats)"],
"product_b": [-0.0321, 0.0654, -0.0987, "... (768 floats)"],
"product_c": [0.0111, -0.0222, 0.0333, "... (768 floats)"]
}

Technology

The Actor is built with the Apify SDK for Python and runs the nomic-embed-text-v1.5 model through FastEmbed, a lightweight inference library from Qdrant. FastEmbed ships a pre-converted ONNX version of the model, so the Actor needs neither PyTorch nor GPU drivers. At runtime, FastEmbed downloads and caches the ONNX weights, tokenizes the input, runs inference via ONNX Runtime on CPU, and returns normalized vectors. This keeps the Docker image small (~0.5-1 GB compared to ~5 GB for PyTorch-based alternatives).

Limitations

  • English only -- Other languages will produce lower-quality embeddings.
  • Token limit -- Texts exceeding ~8,192 tokens (~6,000 English words) are truncated. Split long documents into chunks before embedding.
  • Memory -- The ONNX model alone requires ~520 MB. With the default batch size of 16, total memory usage stays around 1-2 GB regardless of input size (larger inputs just take more batches). Choose an Apify memory tier of 2 GB or above.
  • CPU inference -- The first batch (~16 texts) takes up to a minute due to ONNX Runtime warm-up. Subsequent batches are much faster. Embedding 1,000 short texts takes roughly 1-3 seconds after warm-up. Very large inputs (10,000+ texts) scale linearly; consider splitting across multiple runs.
  • Output size -- Each embedding is 768 floats. At 10,000 keys the output JSON is approximately 150-200 MB.