No credit card required

Chroma Integration

apify/chroma-integration

No credit card required

This integration transfers data from Apify Actors to a Chroma and is a good starting point for a question-answering, search, or RAG use case.

Do you want to learn more about this Actor?

Get a demo

The Apify Chroma integration transfers selected data from Apify Actors to a Chroma database. It processes the data, optionally splits it into chunks, computes embeddings, and saves them to Chroma.

This integration supports incremental updates, updating only the data that has changed. This approach reduces unnecessary embedding computation and storage operations, making it suitable for search and retrieval augmented generation (RAG) use cases.

💡 Note: This Actor is meant to be used together with other Actors' integration sections. For instance, if you are using the Website Content Crawler, you can activate Chroma integration to save web data as vectors to Chroma.

What is Chroma vector database?

Chroma is an open-source, AI-native vector database designed for simplicity and developer productivity. It provides SDKs for Python and JavaScript/TypeScript and includes an option for self-hosted servers.

📋 How does the Apify-Chroma work?

Apify Chroma integration computes text embeddings and store them in Chroma. It uses LangChain to compute embeddings and interact with Chroma.

Retrieve a dataset as output from an Actor
[Optional] Split text data into chunks using langchain's RecursiveCharacterTextSplitter (enable/disable using performChunking and specify chunkSize, chunkOverlap)
[Optional] Update only changed data in Chroma (enable/disable using enableDeltaUpdates)
Compute embeddings, e.g. using OpenAI or Cohere (specify embeddings and embeddingsConfig)
Save data into the database

✅ Before you start

To utilize this integration, ensure you have:

Chroma operational on a server or localhost.
An account to compute embeddings using one of the providers, e.g., OpenAI or Cohere.

For quick Chroma setup, refer to Chroma deployment. Chroma can be run in a Docker container with the following commands:

Docker

1docker pull chromadb/chroma
2docker run -p 8000:8000 chromadb/chroma

Authentication with Docker

To enable static API Token authentication, create a .env file with:

1CHROMA_SERVER_AUTHN_CREDENTIALS=test-token
2CHROMA_SERVER_AUTHN_PROVIDER=chromadb.auth.token_authn.TokenAuthenticationServerProvider

Then run Docker with:

docker run --env-file ./.env -p 8000:8000 chromadb/chroma

If you are running Chroma locally, you can expose the localhost using Ngrok

Install ngrok (you can use it for free or create an account). Expose Chroma using

ngrok http http://localhost:8080

You'll see an output similar to:

1Session Status                online
2Account                       a@a.ai (Plan: Free)
3Forwarding                    https://fdfe-82-208-25-82.ngrok-free.app -> http://localhost:8000

The URL (https://fdfe-82-208-25-82.ngrok-free.app) can be used in the as an input variable for chromaClientHost=https://fdfe-82-208-25-82.ngrok-free.app. Note that your specific URL will vary.

👉 Examples

The configuration consists of three parts: Chroma, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Chroma database. For instance, if you're using the text-embedding-3-small model from OpenAI, it generates vectors of size 1536. This means your Chroma index should also be configured to accommodate vectors of the same size, 1536 in this case.

For detailed input information refer to the Input page.

Database: Chroma

1{
2  "chromaClientHost": "https://fdfe-82-208-25-82.ngrok-free.app",
3  "chromaCollectionName": "chroma",
4  "chromaServerAuthCredentials": "test-token"
5}

Embeddings provider: OpenAI

1{
2  "embeddingsProvider": "OpenAIEmbeddings",
3  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
4  "embeddingsConfig": {"model":  "text-embedding-3-large"}
5}

Save data from Website Content Crawler to Chroma

Data is transferred in the form of a dataset from Website Content Crawler, which provides a dataset with the following output fields (truncated for brevity):

1{
2  "url": "https://www.apify.com",
3  "text": "Apify is a platform that enables developers to build, run, and share automation tasks.",
4  "metadata": {"title": "Apify"}
5}

This dataset is then processed by the Chroma integration. In the integration settings you need to specify which fields you want to save to Chroma, e.g., ["text"] and which of them should be used as metadata, e.g., {"title": "metadata.title"}. Without any other configuration, the data is saved to Chroma as is.

1{
2  "datasetFields": ["text"],
3  "metadataDatasetFields": {"title": "metadata.title"}
4}

Create chunks from Website Content Crawler data and save them to the database

Assume that the text data from the Website Content Crawler is too long to compute embeddings. Therefore, we need to divide the data into smaller pieces called chunks. We can leverage LangChain's RecursiveCharacterTextSplitter to split the text into chunks and save them into a database. The parameters chunkSize and chunkOverlap are important. The settings depend on your use case where a proper chunking helps optimize retrieval and ensures accurate responses.

1{
2  "datasetFields": ["text"],
3  "metadataDatasetFields": {"title": "metadata.title"},
4  "performChunking": true,
5  "chunkSize": 1000,
6  "chunkOverlap": 0
7}

Incrementally update database from the Website Content Crawler

To incrementally update data from the Website Content Crawler to Chroma, configure the integration to update only the changed or new data. This is controlled by the enableDeltaUpdates setting. This way, the integration minimizes unnecessary updates and ensures that only new or modified data is processed.

A checksum is computed for each dataset item (together with all metadata) and stored in the database alongside the vectors. When the data is re-crawled, the checksum is recomputed and compared with the stored checksum. If the checksum is different, the old data (including vectors) is deleted and new data is saved. Otherwise, only the last_seen_at metadata field is updated to indicate when the data was last seen.

Provide unique identifier for each dataset item

To incrementally update the data, you need to be able to uniquely identify each dataset item. The variable deltaUpdatesPrimaryDatasetFields specifies which fields are used to uniquely identify each dataset item and helps track content changes across different crawls. For instance, when working with the Website Content Crawler, you can use the URL as a unique identifier.

1{
2  "enableDeltaUpdates": true,
3  "deltaUpdatesPrimaryDatasetFields": ["url"]
4}

Delete outdated (expired) data

The integration can delete data from the database that hasn't been crawled for a specified period, which is useful when data becomes outdated, such as when a page is removed from a website.

The deletion feature can be enabled or disabled using the deleteExpiredObjects setting.

For each crawl, the last_seen_at metadata field is created or updated. This field records the most recent time the data object was crawled. The expiredObjectDeletionPeriodDays setting is used to control number of days since the last crawl, after which the data object is considered expired. If a database object has not been seen for more than the expiredObjectDeletionPeriodDays, it will be deleted automatically.

The specific value of expiredObjectDeletionPeriodDays depends on your use case.

If a website is crawled daily, expiredObjectDeletionPeriodDays can be set to 7.
If you crawl weekly, it can be set to 30.

To disable this feature, set deleteExpiredObjects to false.

1{
2  "deleteExpiredObjects": true,
3  "expiredObjectDeletionPeriodDays": 30
4}

💡 If you are using multiple Actors to update the same database, ensure that all Actors crawl the data at the same frequency. Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.

💾 Outputs

This integration will save the selected fields from your Actor to Chroma.

🔢 Example configuration

Full Input Example for Website Content Crawler Actor with Chroma integration

1{
2  "chromaClientHost": "https://fdfe-82-208-25-82.ngrok-free.app",
3  "chromaClientSsl": false,
4  "chromaCollectionName": "chroma",
5  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
6  "embeddingsConfig": {
7    "model": "text-embedding-3-small"
8  },
9  "embeddingsProvider": "OpenAI",
10  "datasetFields": [
11    "text"
12  ],
13  "enableDeltaUpdates": true,
14  "deltaUpdatesPrimaryDatasetFields": ["url"],
15  "deleteExpiredObjects": true,
16  "expiredObjectDeletionPeriodDays": 30,
17  "performChunking": true,
18  "chunkSize": 2000,
19  "chunkOverlap": 200
20}

Chroma

1{
2  "chromaClientHost": "https://fdfe-82-208-25-82.ngrok-free.app",
3  "chromaCollectionName": "chroma",
4  "chromaServerAuthCredentials": "test-token"
5}

OpenAI embeddings

1{
2  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
3  "embeddings": "OpenAI",
4  "embeddingsConfig": {"model":  "text-embedding-3-large"}
5}

Cohere embeddings

1{
2  "embeddingsApiKey": "YOUR-COHERE-API-KEY",
3  "embeddings": "Cohere",
4  "embeddingsConfig": {"model":  "embed-multilingual-v3.0"}
5}

Developer

Apify

Actor Metrics

1 monthly user
0 No stars yet
Created in Jun 2024
Modified 3 months ago

Categories

Integrations

For creators

Instagram Scraper

apify/instagram-scraper

Scrape and download Instagram posts, profiles, places, hashtags, photos, and comments. Get data from Instagram using one or more Instagram URLs or search queries. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Apify

60.9k

556

Instagram API Scraper

apify/instagram-api-scraper

Scrape and download Instagram posts, profiles, places, hashtags, photos without login. Supports search keywords and URL lists. Download your data as HTML table, JSON, CSV, Excel, XML, and RSS feed.

Apify

2.9k

Apollo.io leads scraper

curious_coder/apollo-io-scraper

Scrape leads from apollo io search results including verified emails and phone numbers. This apollo io data scraping bot helps you to extract leads from your apollo search results without subscribing to costly apollo pricing plans

Curious Coder

20k

203

Apollo company search scraper

curious_coder/apollo-company-scraper

Scrape company search results from apollo and company website, social media urls, alexa rank, phone numbers, etc

Curious Coder

504

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

27.8k

697

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

5.7k

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

72k

237

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

4.7k

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, AI overviews, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export scraped data, run the scraper via API, schedule runs, or integrate with other tools.

Apify

50.6k

248

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.