Pricing

Pay per usage

Vector Embeddings Generator

Turn any text into semantic embedding vectors — perfect for search, similarity matching, clustering, and recommendations. Just feed your texts as JSON or a URL and get 768-dimensional vectors back. Powered by nomic-embed-text-v1.5 with 8K token context. No GPU needed.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Matej Hamas

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What are text embeddings?

Text embeddings are numerical representations of text that capture semantic meaning. Similar texts produce vectors that are close together in a high-dimensional space, which lets you compare meaning mathematically rather than relying on exact keyword matches.

Use cases

Semantic search -- find results relevant to a query even when the wording differs
Similarity matching -- measure how closely related two pieces of text are
Clustering -- group related texts automatically by vector proximity
Deduplication -- detect near-duplicate content regardless of phrasing
Recommendations -- suggest similar items based on description similarity

Model

This Actor uses nomic-ai/nomic-embed-text-v1.5 via FastEmbed, a lightweight ONNX-based inference library optimized for CPU.

Property	Value
Dimensions	768
Max sequence length	8,192 tokens (~6,000 English words)
Language	English
Similarity metric	Cosine similarity (or dot product -- vectors are L2-normalized)

Because the output vectors are L2-normalized (unit length), cosine similarity and dot product produce identical results -- use whichever your downstream tool expects.

Input

The Actor accepts three parameters:

`jsonData` or `jsonUrl` (required -- provide exactly one)

jsonData -- A JSON object where keys are identifiers and values are the texts to embed.

jsonUrl -- A publicly accessible URL that returns a JSON object in the same key-value format.

{
    "jsonData": {
        "product_a": "A lightweight running shoe for daily training",
        "product_b": "Heavy duty waterproof hiking boots",
        "product_c": "Casual summer sandal for the beach"
    },
    "taskType": "search_document"
}

Provide jsonData or jsonUrl, not both.
All values must be strings; keys can be any string and are preserved as-is in the output.
The object must contain at least one entry.
When using jsonUrl, the URL must be publicly accessible and return raw JSON (not an HTML page).
Each text value can be up to 8,192 tokens long (roughly 6,000 English words). Longer texts are truncated by the model.

Using Google Drive as a JSON source: Google Drive share links (https://drive.google.com/file/d/FILE_ID/view?usp=sharing) return an HTML preview page, not raw JSON. To get the direct download URL, extract the FILE_ID from the share link and use this format instead:

https://drive.google.com/uc?export=download&id=FILE_ID

`taskType` (optional, default: `search_document`)

The nomic model optimizes embeddings differently depending on the intended use case. The selected task type is prepended to each text internally before embedding.

Task type	When to use
`search_document`	Embedding content that will be searched against -- product descriptions, articles, knowledge base entries.
`search_query`	Embedding the user's search query. For best retrieval accuracy, embed your documents with `search_document` and your queries with `search_query`.
`clustering`	Grouping texts by similarity -- topic detection, organizing collections of documents.
`classification`	Feeding embeddings into a classifier that assigns labels or categories to texts.

Embeddings generated with different task types are not directly comparable -- always use the same task type for texts you intend to compare, except for the search_document / search_query pair which is designed to work together.

Output

Results are stored in the default run key-value store under the key embeddings. The output mirrors the input structure: each key maps to a 768-element array of floats. The vectors are L2-normalized (unit length), so you can use dot product directly as cosine similarity.

{
    "product_a": [0.0123, -0.0456, 0.0789, "... (768 floats)"],
    "product_b": [-0.0321, 0.0654, -0.0987, "... (768 floats)"],
    "product_c": [0.0111, -0.0222, 0.0333, "... (768 floats)"]
}

Technology

The Actor is built with the Apify SDK for Python and runs the nomic-embed-text-v1.5 model through FastEmbed, a lightweight inference library from Qdrant. FastEmbed ships a pre-converted ONNX version of the model, so the Actor needs neither PyTorch nor GPU drivers. At runtime, FastEmbed downloads and caches the ONNX weights, tokenizes the input, runs inference via ONNX Runtime on CPU, and returns normalized vectors. This keeps the Docker image small (~0.5-1 GB compared to ~5 GB for PyTorch-based alternatives).

Limitations

English only -- Other languages will produce lower-quality embeddings.
Token limit -- Texts exceeding ~8,192 tokens (~6,000 English words) are truncated. Split long documents into chunks before embedding.
Memory -- The ONNX model alone requires ~520 MB. With the default batch size of 16, total memory usage stays around 1-2 GB regardless of input size (larger inputs just take more batches). Choose an Apify memory tier of 2 GB or above.
CPU inference -- The first batch (~16 texts) takes up to a minute due to ONNX Runtime warm-up. Subsequent batches are much faster. Embedding 1,000 short texts takes roughly 1-3 seconds after warm-up. Very large inputs (10,000+ texts) scale linearly; consider splitting across multiple runs.
Output size -- Each embedding is 768 floats. At 10,000 keys the output JSON is approximately 150-200 MB.

Rag Vector Store Writer

labrat011/rag-vector-store-writer

Apify Actor that writes embedding vectors to Pinecone or Qdrant vector databases. Chains directly with RAG Embedding Generator output or accepts raw vectors with metadata. Handles batching, retries, collection creation, metadata mapping, and ID generation. Bring your own vector DB API key.

mick_

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

mick_

Similarity Graph From Embeddings

mhamas/similarity-graph-from-embeddings

Builds a similarity graph from vector embeddings. Fetches vectors from URLs, computes pairwise cosine similarities using optimized linear algebra, and connects each point to its K nearest neighbors - revealing hidden clusters and relationships in your high-dimensional data.

Matej Hamas

RAG Pipeline

labrat011/rag-pipeline

One-click RAG pipeline: chunks text, generates embeddings, and stores vectors in Pinecone or Qdrant. Provide your content and API keys -- the orchestrator handles the rest.

mick_

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

mick_

UUID Generator

rl1987/uuid-generator

Generate bulk universally unique identifiers (UUID v1, v3, v4, v5, v7) on demand. Export as JSON, CSV, Excel or plain text.

R.L.

Vector Loader — Document Embedding & Vector DB Ingestion

apricot_blackberry/vector-loader

Vector Loader — Document Embedding & Vector DB Ingestion helps teams get quick, high-signal results with reliable output, clear fields, and fast setup.

Creator Fusion

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

UUID Generator

apizy/uuid-generator

Generate UUID v1, v3, v4, v5 instantly. Perfect for test data, unique IDs, database seeding, and development workflows. Choose random v4 (most common), time-based v1, or deterministic v3/v5. Customize hyphen format. Export results via Dataset or API. Fast, no-code tool with scheduling and monitoring