The Apify Chroma integration transfers selected data from Apify Actors to a Chroma database. It processes the data, optionally splits it into chunks, computes embeddings, and saves them to Chroma.

This integration supports incremental updates, updating only the data that has changed. This approach reduces unnecessary embedding computation and storage operations, making it suitable for search and retrieval augmented generation (RAG) use cases.

💡 Note: This Actor is meant to be used together with other Actors' integration sections. For instance, if you are using the Website Content Crawler, you can activate Chroma integration to save web data as vectors to Chroma.

What is Chroma vector database?

Chroma is an open-source, AI-native vector database designed for simplicity and developer productivity. It provides SDKs for Python and JavaScript/TypeScript and includes an option for self-hosted servers.

📋 How does the Apify-Chroma work?

Apify Chroma integration computes text embeddings and store them in Chroma. It uses LangChain to compute embeddings and interact with Chroma.

Retrieve a dataset as output from an Actor
[Optional] Split text data into chunks using langchain's RecursiveCharacterTextSplitter (enable/disable using performChunking and specify chunkSize, chunkOverlap)
[Optional] Update only changed data (select dataUpdatesStrategy)
Compute embeddings, e.g. using OpenAI or Cohere (specify embeddingsProvider and embeddingsConfig)
Save data into the database

✅ Before you start

To utilize this integration, ensure you have:

Chroma operational on a remote server or cloud instance.
An account to compute embeddings using one of the providers, e.g., OpenAI or Cohere.

👉 Examples

The configuration consists of three parts: Chroma, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Chroma database. For instance, if you're using the text-embedding-3-small model from OpenAI, it generates vectors of size 1536. This means your Chroma index should also be configured to accommodate vectors of the same size, 1536 in this case.

For detailed input information refer to the Input page.

Database: Chroma (simple)

{
  "chromaCollectionName": "chroma",
  "chromaClientHost": "https://your-chroma-instance.com",
  "chromaApiToken": "your-api-token"
}

Database: Chroma with tenant and database (cloud/enterprise)

{
  "chromaCollectionName": "chroma",
  "chromaClientHost": "https://your-chroma-instance.chroma.cloud",
  "chromaApiToken": "your-api-token",
  "chromaTenant": "your-tenant-id",
  "chromaDatabase": "your-database-name"
}

Embeddings provider: OpenAI

{
  "embeddingsProvider": "OpenAI",
  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
  "embeddingsConfig": {"model":  "text-embedding-3-large"}
}

Embeddings provider: Cohere

{
  "embeddingsProvider": "Cohere",
  "embeddingsApiKey": "YOUR-COHERE-API-KEY",
  "embeddingsConfig": {"model":  "embed-multilingual-v3.0"}
}

Save data from Website Content Crawler to Chroma

Data is transferred in the form of a dataset from Website Content Crawler, which provides a dataset with the following output fields (truncated for brevity):

{
  "url": "https://www.apify.com",
  "text": "Apify is a platform that enables developers to build, run, and share automation tasks.",
  "metadata": {"title": "Apify"}
}

This dataset is then processed by the Chroma integration. In the integration settings you need to specify which fields you want to save to Chroma, e.g., ["text"] and which of them should be used as metadata, e.g., {"title": "metadata.title"}. Without any other configuration, the data is saved to Chroma as is.

{
  "datasetFields": ["text"],
  "metadataDatasetFields": {"title": "metadata.title"}
}

Create chunks from Website Content Crawler data and save them to the database

Assume that the text data from the Website Content Crawler is too long to compute embeddings. Therefore, we need to divide the data into smaller pieces called chunks. We can leverage LangChain's RecursiveCharacterTextSplitter to split the text into chunks and save them into a database. The parameters chunkSize and chunkOverlap are important. The settings depend on your use case where a proper chunking helps optimize retrieval and ensures accurate responses.

{
  "datasetFields": ["text"],
  "metadataDatasetFields": {"title": "metadata.title"},
  "performChunking": true,
  "chunkSize": 1000,
  "chunkOverlap": 0
}

Configure update strategy

To control how the integration updates data in the database, use the dataUpdatesStrategy parameter. This parameter allows you to choose between different update strategies based on your use case, such as adding new data, upserting records, or incrementally updating records based on changes (deltas). Below are the available strategies and explanations for when to use each:

Add data (add):
- Appends new data to the database without checking for duplicates or updating existing records.
- Suitable for cases where deduplication or updates are unnecessary, and the data simply needs to be added.
- For example, you might use this strategy to continually append data from independent crawls without regard for overlaps.
Upsert data (upsert):
- Delete existing records in the database if they match a key or identifier and inserts new records.
- Ideal when you want to maintain accurate and up-to-date data while avoiding duplication.
- For instance, this is useful in cases where unique items (such as user profiles or documents) need to be managed, ensuring the database reflects the latest changes.
- Check the dataUpdatesPrimaryDatasetFields parameter to specify which fields are used to uniquely identify each dataset item.
Update changed data based on deltas (deltaUpdates):
- Incrementally updates records by identifying differences (deltas) between the new dataset and the existing database records.
- Ensures only new or modified records are processed, leaving unchanged records untouched. This minimizes unnecessary database operations and improves efficiency.
- This is the most efficient strategy when integrating data that evolves over time, such as website content or recurring crawls.
- Check the dataUpdatesPrimaryDatasetFields parameter to specify which fields are used to uniquely identify each dataset item.

Incrementally update database from the Website Content Crawler

To incrementally update data from the Website Content Crawler to database, configure the integration to update only the changed or new data. This is controlled by the dataUpdatesStrategy setting. This way, the integration minimizes unnecessary updates and ensures that only new or modified data is processed.

A checksum is computed for each dataset item (together with all metadata) and stored in the database alongside the vectors. When the data is re-crawled, the checksum is recomputed and compared with the stored checksum. If the checksum is different, the old data (including vectors) is deleted and new data is saved. Otherwise, only the last_seen_at metadata field is updated to indicate when the data was last seen.

Provide unique identifier for each dataset item

To incrementally update the data, you need to be able to uniquely identify each dataset item. The variable dataUpdatesPrimaryDatasetFields specifies which fields are used to uniquely identify each dataset item and helps track content changes across different crawls. For instance, when working with the Website Content Crawler, you can use the URL as a unique identifier.

{
  "dataUpdatesStrategy": "deltaUpdates",
  "dataUpdatesPrimaryDatasetFields": ["url"]
}

To fully maximize the potential of incremental data updates, it is recommended to start with an empty database. While it is possible to use this feature with an existing database, records that were not originally saved using a prefix or metadata will not be updated.

Delete outdated (expired) data

The integration can delete data from the database that hasn't been crawled for a specified period, which is useful when data becomes outdated, such as when a page is removed from a website.

The deletion feature can be enabled or disabled using the deleteExpiredObjects setting.

For each crawl, the last_seen_at metadata field is created or updated. This field records the most recent time the data object was crawled. The expiredObjectDeletionPeriodDays setting is used to control number of days since the last crawl, after which the data object is considered expired. If a database object has not been seen for more than the expiredObjectDeletionPeriodDays, it will be deleted automatically.

The specific value of expiredObjectDeletionPeriodDays depends on your use case.

If a website is crawled daily, expiredObjectDeletionPeriodDays can be set to 7.
If you crawl weekly, it can be set to 30.

To disable this feature, set deleteExpiredObjects to false.

{
  "deleteExpiredObjects": true,
  "expiredObjectDeletionPeriodDays": 30
}

💡 If you are using multiple Actors to update the same database, ensure that all Actors crawl the data at the same frequency. Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.

Batch size configuration

You can control the number of documents sent to Chroma in a single request using the chromaBatchSize parameter. The default is 300. Lower this value if you experience timeouts or want finer control over insert operations.

💾 Outputs

This integration will save the selected fields from your Actor to Chroma.

🔢 Example configuration

Full Input Example for Website Content Crawler Actor with Chroma integration

{
  "chromaCollectionName": "chroma",
  "chromaClientHost": "https://your-chroma-instance.com",
  "chromaClientSsl": true,
  "embeddingsProvider": "OpenAI",
  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
  "embeddingsConfig": {
    "model": "text-embedding-3-small"
  },
  "datasetFields": [
    "text"
  ],
  "dataUpdatesStrategy": "deltaUpdates",
  "dataUpdatesPrimaryDatasetFields": ["url"],
  "deleteExpiredObjects": true,
  "expiredObjectDeletionPeriodDays": 7,
  "performChunking": true,
  "chunkSize": 2000,
  "chunkOverlap": 200
}

Chroma (simple)

{
  "chromaCollectionName": "chroma",
  "chromaClientHost": "https://your-chroma-instance.com",
  "chromaApiToken": "your-api-token"
}

Chroma (cloud/enterprise with tenant and database)

{
  "chromaCollectionName": "chroma",
  "chromaClientHost": "https://your-chroma-instance.chroma.cloud",
  "chromaApiToken": "your-api-token",
  "chromaTenant": "your-tenant-id",
  "chromaDatabase": "your-database-name"
}

OpenAI embeddings

{
  "embeddingsProvider": "OpenAI",
  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
  "embeddingsConfig": {"model":  "text-embedding-3-large"}
}

Cohere embeddings

{
  "embeddingsProvider": "Cohere",
  "embeddingsApiKey": "YOUR-COHERE-API-KEY",
  "embeddingsConfig": {"model":  "embed-multilingual-v3.0"}
}

On this page

Chroma integration

Share Actor:

OpenAI Vector Store Integration

jiri.spilka/openai-vector-store-integration

This integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

Jiří Spilka

204

4.8

Pinecone Integration

apify/pinecone-integration

This integration transfers data from Apify Actors to a Pinecone and is a good starting point for a question-answering, search, or RAG use case.

Apify

499

3.2

Weaviate Integration

apify/weaviate-integration

This integration transfers data from Apify Actors to a Weaviate and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.7

Opengauss Integration

wyswyz/opengauss-integration

This integration transfers data from Apify Actors to an openGauss database and is a good starting point for a question-answering, search, or RAG use case.

Y Wang

Milvus Integration

apify/milvus-integration

This integration transfers data from Apify Actors to a Milvus/Zilliz database and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.5

OpenSearch Integration

apify/opensearch-integration

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Apify

4.4

Suumo Scraper

jungle_synthesizer/suumo-scraper

Scrape real estate listings from Suumo.jp

BowTiedRacoon

Advanced Website Domain Name Validator

saswave/advanced-website-domain-name-validator

Advanced domain scraper. Determine if a domain is still valid or has moved. We test multiple scenario before flagging the domain as invalid. Extract technologies stack, social account, emails

SASWAVE

Website To PDF Converter

louisdeconinck/website-to-pdf-converter

Convert websites to high-quality PDF documents with customizable options. This powerful actor allows you to transform website pages with both static HTML and dynamic content into professional-grade PDFs, offering a wide range of customization features such as page format, orientation, margins, …