Milvus Integration avatar

Milvus Integration

Try for free

No credit card required

Go to Store
Milvus Integration

Milvus Integration

apify/milvus-integration
Try for free

No credit card required

This integration transfers data from Apify Actors to a Milvus/Zilliz database and is a good starting point for a question-answering, search, or RAG use case.

Do you want to learn more about this Actor?

Get a demo

The Apify Milvus integration transfers selected data from Apify Actors to a Milvus/Zilliz database. It processes the data, optionally splits it into chunks, computes embeddings, and saves them to Milvus.

This integration supports incremental updates, updating only the data that has changed. This approach reduces unnecessary embedding computation and storage operations, making it suitable for search and retrieval augmented generation (RAG) use cases.

💡 Note: This Actor is meant to be used together with other Actors' integration sections. For instance, if you are using the Website Content Crawler, you can activate Milvus integration to save web data as vectors to Milvus.

What is Milvus/Zilliz vector database?

Milvus is an open-source vector database designed for similarity searches on large datasets of high-dimensional vectors. Its emphasis on efficient vector similarity search enables the development of robust and scalable retrieval systems. The Milvus database hosted at Zilliz demonstrates top performance in the Vector Database Benchmark.

📋 How does the Apify-Milvus/Zilliz integration work?

Apify Milvus integration computes text embeddings and store them in Milvus. It uses LangChain to compute embeddings and interact with Milvus.

  1. Retrieve a dataset as output from an Actor
  2. [Optional] Split text data into chunks using langchain's RecursiveCharacterTextSplitter (enable/disable using performChunking and specify chunkSize, chunkOverlap)
  3. [Optional] Update only changed data in Milvus (enable/disable using enableDeltaUpdates)
  4. Compute embeddings, e.g. using OpenAI or Cohere (specify embeddings and embeddingsConfig)
  5. Save data into the database

Apify-pinecone-integration

✅ Before you start

To use this integration, ensure you have:

  • Created or existing Milvus database. You need to know milvusUri, milvusToken, and milvusCollectionName.
  • If the collection does not exist, it will be created automatically.
  • An account to compute embeddings using one of the providers, e.g., OpenAI or Cohere.

Set up Milvus/Zilliz URI, token and collection name

You can run Milvus using Docker or try the managed Milvus service at Zilliz. For more details, please refer to the Milvus documentation.

You need the URI and Token of your Milvus/Zilliz to setup the client.

  • If you have self-deployed Milvus server on Docker or Kubernetes, use the server address and port as your uri, e.g.http://localhost:19530. If you enable the authentication feature on Milvus, use "<your_username>:<your_password>" as the token, otherwise leave the token as empty string.
  • If you use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the uri and token, which correspond to the Public Endpoint and API key in Zilliz Cloud.

Note that the collection does not need to exist beforehand. It will be automatically created when data is uploaded to the database.

👉 Examples

The configuration consists of three parts: Milvus, embeddings provider and data.

Ensure that the vector size of your embeddings aligns with the configuration of your Milvus index. For instance, if you're using the text-embedding-3-small model from OpenAI, it generates vectors of size 1536. This means your Milvus index should also be configured to accommodate vectors of the same size, 1536 in this case.

For detailed input information refer to the Input page.

Database: Milvus

1{
2  "milvusUri": "YOUR-MILVUS-URI",
3  "milvusToken": "YOUR-MILVUS-TOKEN",
4  "milvusCollectionName": "YOUR-MILVUS-COLLECTION-NAME"
5}

Please refer to the instructions above on how to set up the Milvus/Zilliz URI, token, and collection name.

Embeddings provider: OpenAI

1{
2  "embeddingsProvider": "OpenAIEmbeddings",
3  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
4  "embeddingsConfig": {"model":  "text-embedding-3-large"}
5}

Save data from Website Content Crawler to Milvus

Data is transferred in the form of a dataset from Website Content Crawler, which provides a dataset with the following output fields (truncated for brevity):

1{
2  "url": "https://www.apify.com",
3  "text": "Apify is a platform that enables developers to build, run, and share automation tasks.",
4  "metadata": {"title": "Apify"}
5}

This dataset is then processed by the Milvus integration. In the integration settings you need to specify which fields you want to save to Milvus, e.g., ["text"] and which of them should be used as metadata, e.g., {"title": "metadata.title"}. Without any other configuration, the data is saved to Milvus as is.

1{
2  "datasetFields": ["text"],
3  "metadataDatasetFields": {"title": "metadata.title"}
4}

Create chunks from Website Content Crawler data and save them to the database

Assume that the text data from the Website Content Crawler is too long to compute embeddings. Therefore, we need to divide the data into smaller pieces called chunks. We can leverage LangChain's RecursiveCharacterTextSplitter to split the text into chunks and save them into a database. The parameters chunkSize and chunkOverlap are important. The settings depend on your use case where a proper chunking helps optimize retrieval and ensures accurate responses.

1{
2  "datasetFields": ["text"],
3  "metadataDatasetFields": {"title": "metadata.title"},
4  "performChunking": true,
5  "chunkSize": 1000,
6  "chunkOverlap": 0
7}

Incrementally update database from the Website Content Crawler

To incrementally update data from the Website Content Crawler to Milvus, configure the integration to update only the changed or new data. This is controlled by the enableDeltaUpdates setting. This way, the integration minimizes unnecessary updates and ensures that only new or modified data is processed.

A checksum is computed for each dataset item (together with all metadata) and stored in the database alongside the vectors. When the data is re-crawled, the checksum is recomputed and compared with the stored checksum. If the checksum is different, the old data (including vectors) is deleted and new data is saved. Otherwise, only the last_seen_at metadata field is updated to indicate when the data was last seen.

Provide unique identifier for each dataset item

To incrementally update the data, you need to be able to uniquely identify each dataset item. The variable deltaUpdatesPrimaryDatasetFields specifies which fields are used to uniquely identify each dataset item and helps track content changes across different crawls. For instance, when working with the Website Content Crawler, you can use the URL as a unique identifier.

1{
2  "enableDeltaUpdates": true,
3  "deltaUpdatesPrimaryDatasetFields": ["url"]
4}

Delete outdated (expired) data

The integration can delete data from the database that hasn't been crawled for a specified period, which is useful when data becomes outdated, such as when a page is removed from a website.

The deletion feature can be enabled or disabled using the deleteExpiredObjects setting.

For each crawl, the last_seen_at metadata field is created or updated. This field records the most recent time the data object was crawled. The expiredObjectDeletionPeriodDays setting is used to control number of days since the last crawl, after which the data object is considered expired. If a database object has not been seen for more than the expiredObjectDeletionPeriodDays, it will be deleted automatically.

The specific value of expiredObjectDeletionPeriodDays depends on your use case.

  • If a website is crawled daily, expiredObjectDeletionPeriodDays can be set to 7.
  • If you crawl weekly, it can be set to 30.

To disable this feature, set deleteExpiredObjects to false.

1{
2  "deleteExpiredObjects": true,
3  "expiredObjectDeletionPeriodDays": 30
4}

💡 If you are using multiple Actors to update the same database, ensure that all Actors crawl the data at the same frequency. Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.

💾 Outputs

This integration will save the selected fields from your Actor to Milvus and store the chunked data in the Apify dataset.

🔢 Example configuration

Full Input Example for Website Content Crawler Actor with Milvus integration

1{
2  "milvusUri": "YOUR-MILVUS-URI",
3  "milvusToken": "YOUR-MILVUS-TOKEN",
4  "milvusCollectionName": "YOUR-MILVUS-COLLECTION-NAME",
5  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
6  "embeddingsConfig": {
7    "model": "text-embedding-3-small"
8  },
9  "embeddingsProvider": "OpenAI",
10  "datasetFields": [
11    "text"
12  ],
13  "enableDeltaUpdates": true,
14  "deltaUpdatesPrimaryDatasetFields": ["url"],
15  "expiredObjectDeletionPeriodDays": 7,
16  "performChunking": true,
17  "chunkSize": 2000,
18  "chunkOverlap": 200
19}

Milvus

1{
2  "milvusUri": "YOUR-MILVUS-URI",
3  "milvusToken": "YOUR-MILVUS-TOKEN",
4  "milvusCollectionName": "YOUR-MILVUS-COLLECTION-NAME"
5}

Managed Milvus service at Zilliz

1{
2  "milvusUri": "https://in03-***********.api.gcp-us-west1.zillizcloud.com",
3  "milvusToken": "d46**********b4b",
4  "milvusCollectionName": "YOUR-MILVUS-COLLECTION-NAME"
5}

OpenAI embeddings

1{
2  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
3  "embeddings": "OpenAI",
4  "embeddingsConfig": {"model":  "text-embedding-3-large"}
5}

Cohere embeddings

1{
2  "embeddingsApiKey": "YOUR-COHERE-API-KEY",
3  "embeddings": "Cohere",
4  "embeddingsConfig": {"model":  "embed-multilingual-v3.0"}
5}
Developer
Maintained by Apify

Actor Metrics

  • 3 monthly users

  • 1 star

  • >99% runs succeeded

  • Created in Jul 2024

  • Modified 2 months ago

Categories