Similarity Graph From Embeddings avatar

Similarity Graph From Embeddings

Under maintenance

Pricing

Pay per usage

Go to Apify Store
Similarity Graph From Embeddings

Similarity Graph From Embeddings

Under maintenance

Builds a similarity graph from vector embeddings. Fetches vectors from URLs, computes pairwise cosine similarities using optimized linear algebra, and connects each point to its K nearest neighbors - revealing hidden clusters and relationships in your high-dimensional data.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Matej Hamas

Matej Hamas

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

12 days ago

Last modified

Share

Apify Actor that builds a similarity graph from vector embeddings using cosine similarity. It fetches vectors from provided URLs, computes pairwise cosine similarities, filters edges by configurable outgoing and incoming limits per node, and outputs a graph JSON.

How it works

  1. Fetch vectors — Downloads JSON data from each provided URL. Each URL must return a JSON object mapping IDs to float arrays: { "id1": [0.1, 0.2, ...], "id2": [0.3, 0.4, ...] }.
  2. Validate — All vectors must have the same dimensionality. Duplicate IDs across URLs are not allowed.
  3. Compute similarities — Builds a full cosine similarity matrix using vectorized numpy operations (L2-normalize + single matrix multiply via BLAS).
  4. Filter edges — Applies global top-percentage threshold, then limits outgoing edges per node, then limits incoming edges per node. Each filter keeps only the highest-similarity edges using argpartition for O(n) performance.
  5. Build graph — Each vector becomes a node. Surviving edges become directed edges with cosine similarity as the weight.
  6. Store output — The graph is saved as graph.json in the default key-value store. A link to the file is pushed to the default dataset and displayed on the output tab.

Input

FieldTypeRequiredDefaultDescription
urlsstring[]YesList of URLs, each returning JSON of form { id: [float, float, ...] }. All vectors must have the same dimensionality.
topPercentagenumberNo100Keep only the top X% of all pairwise similarities (globally). Lower values produce sparser graphs. Applied before the per-node edge limits.
maxOutgoingEdgesPerNodeintegerNoFor each node, keep only the top K most similar neighbors as outgoing edges. If not set, all edges surviving the top percentage filter are kept. Applied before the incoming edges limit.
maxIncomingEdgesPerNodeintegerNoFor each node, keep only the top K highest-similarity incoming edges. If not set, incoming edges are not limited. Applied after the outgoing edges limit.
keepAtLeastOneEdgebooleanNofalseWhen enabled, each node always keeps its most similar neighbor regardless of other filtering. Prevents isolated nodes in the graph.

Example input

{
"urls": [
"https://example.com/embeddings-part1.json",
"https://example.com/embeddings-part2.json"
],
"maxOutgoingEdgesPerNode": 10,
"maxIncomingEdgesPerNode": 20
}

Expected URL response format

Each URL must return a JSON object where keys are string IDs and values are arrays of floats (all the same length):

{
"apple": [0.12, 0.85, 0.33, 0.67],
"banana": [0.11, 0.82, 0.30, 0.71],
"car": [0.90, 0.05, 0.88, 0.12]
}

Output

Key-value store

The Actor stores a single file graph.json in the default key-value store. Example:

{
"version": "1",
"nodes": [
{ "id": "apple" },
{ "id": "banana" },
{ "id": "car" }
],
"edges": [
{ "source": "apple", "target": "banana", "weight": 0.987 },
{ "source": "banana", "target": "apple", "weight": 0.987 }
]
}
  • Nodes — One per vector ID from the input data.
  • Edges — Directed. Outgoing edges per node are limited by maxOutgoingEdgesPerNode, incoming edges by maxIncomingEdgesPerNode. Edge weight is the cosine similarity (0 to 1).

Graph JSON schema

The output graph.json conforms to the following JSON schema:

{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Similarity Graph",
"type": "object",
"required": ["version", "nodes", "edges"],
"properties": {
"version": {
"type": "string",
"const": "1"
},
"nodes": {
"type": "array",
"items": {
"type": "object",
"required": ["id"],
"properties": {
"id": {
"type": "string",
"description": "Vector ID from the input data."
}
}
}
},
"edges": {
"type": "array",
"items": {
"type": "object",
"required": ["source", "target", "weight"],
"properties": {
"source": {
"type": "string",
"description": "ID of the source node."
},
"target": {
"type": "string",
"description": "ID of the target node."
},
"weight": {
"type": "number",
"description": "Cosine similarity between source and target vectors."
}
}
}
}
}
}

Dataset

The default dataset contains a single record with the public URL of the graph JSON file:

{
"graphUrl": "https://api.apify.com/v2/key-value-stores/<store-id>/records/graph.json"
}