PGVector Integration avatar
PGVector Integration
Try for free

No credit card required

View all Actors
PGVector Integration

PGVector Integration

apify/pgvector-integration
Try for free

No credit card required

This integration transfers data from Apify Actors to a Postgres SQL database (with PGVector extension).

PostgreSQL (with PGVector) integration

The Apify PGVector integration transfers selected data from Apify Actors to PostgreSQL (with PGVector extension). It processes the data, optionally splits it into chunks, computes embeddings, and saves them to PostgreSQL.

This integration supports incremental updates, updating only the data that has changed. This approach reduces unnecessary embedding computation and storage operations, making it suitable for search and retrieval augmented generation (RAG) use cases.

馃挕 Note: This Actor is meant to be used together with other Actors' integration sections. For instance, if you are using the Website Content Crawler, you can activate PGVector integration to save web data as vectors to PostgreSQL.

How does it work?

Apify PGVector integration computes text embeddings and store them in PostgreSQL. It uses LangChain to compute embeddings and interact with PGVector.

  1. Retrieve a dataset as output from an Actor
  2. [Optional] Split text data into chunks using langchain's RecursiveCharacterTextSplitter (enable/disable using performChunking and specify chunkSize, chunkOverlap)
  3. [Optional] Update only changed data in PostgreSQL (enable/disable using enableDeltaUpdates)
  4. Compute embeddings, e.g. using OpenAI or Cohere (specify embeddings and embeddingsConfig)
  5. Save data into the database

Before you start

To utilize this integration, ensure you have:

  • Created or existing PostgreSQL database with PGVector extension. You need to know postgresSqlConnectionStr and postgresCollectionName.
  • An account to compute embeddings using one of the providers, e.g., OpenAI or Cohere.

Examples

For detailed input information refer to input schema.

The configuration consists of three parts: PGVector, embeddings provider, and data.

Ensure that the vector size of your embeddings aligns with the configuration of your PostgreSQL. For instance, if you're using the text-embedding-3-small model from OpenAI, it generates vectors of size 1536. This means your PostgreSQL vector should also be configured to accommodate vectors of the same size, 1536 in this case.

Database: PostgreSQL with PGVector

1{
2  "postgresSqlConnectionStr": "postgresql://postgres:password@localhost:5432/apify",
3  "postgresCollectionName": "apify-collection"
4}

Embeddings provider: OpenAI

1{
2  "embeddingsProvider": "OpenAIEmbeddings",
3  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
4  "embeddingsConfig": {"model":  "text-embedding-3-large"}
5}

Save data from Website Content Crawler to PostgreSQL

Data is transferred in the form of a dataset from Website Content Crawler, which provides a dataset with the following output fields (truncated for brevity):

1{
2  "url": "https://www.apify.com",
3  "text": "Apify is a platform that enables developers to build, run, and share automation tasks.",
4  "metadata": {"title": "Apify"}
5}

This dataset is then processed by the PGVector integration. In the integration settings you need to specify which fields you want to save to PostgreSQL, e.g., ["text"] and which of them should be used as metadata, e.g., {"title": "metadata.title"}. Without any other configuration, the data is saved to PostgreSQL as is.

1{
2  "datasetFields": ["text"],
3  "metadataDatasetFields": {"title": "metadata.title"}
4}

Create chunks from Website Content Crawler data and save them to the database

Assume that the text data from the Website Content Crawler is too long to compute embeddings. Therefore, we need to divide the data into smaller pieces called chunks. We can leverage LangChain's RecursiveCharacterTextSplitter to split the text into chunks and save them into a database. The parameters chunkSize and chunkOverlap are important. The settings depend on your use case where a proper chunking helps optimize retrieval and ensures accurate responses.

1{
2  "datasetFields": ["text"],
3  "metadataDatasetFields": {"title": "metadata.title"},
4  "performChunking": true,
5  "chunkSize": 1000,
6  "chunkOverlap": 0
7}

Incrementally update database from the Website Content Crawler

To incrementally update data from the Website Content Crawler to PostgreSQL, configure the integration to update only the changed or new data. This is controlled by the enableDeltaUpdates setting. This way, the integration minimizes unnecessary updates and ensures that only new or modified data is processed.

A checksum is computed for each dataset item (together with all metadata) and stored in the database alongside the vectors. When the data is re-crawled, the checksum is recomputed and compared with the stored checksum. If the checksum is different, the old data (including vectors) is deleted and new data is saved. Otherwise, only the last_seen_at metadata field is updated to indicate when the data was last seen.

Provide unique identifier for each dataset item

To incrementally update the data, you need to be able to uniquely identify each dataset item. The variable deltaUpdatesPrimaryDatasetFields specifies which fields are used to uniquely identify each dataset item and helps track content changes across different crawls. For instance, when working with the Website Content Crawler, you can use the URL as a unique identifier.

1{
2  "enableDeltaUpdates": true,
3  "deltaUpdatesPrimaryDatasetFields": ["url"]
4}

Delete outdated data

The integration can also delete data from PostgreSQL that hasn't been crawled for a specified period. This is useful when data in the PostgreSQL database becomes outdated, such as when a page is removed from a website.

This is controlled by the expiredObjectDeletionPeriodDays setting, which automatically deletes data older than the specified number of days. For each crawl, the last_seen_at metadata field is updated. When a database object has not been seen for more than expiredObjectDeletionPeriodDays days, it is deleted.

The specific value of expiredObjectDeletionPeriodDays depends on your use case. Typically, if a website is crawled daily, expiredObjectDeletionPeriodDays can be set to 7. If you crawl weekly, it can be set to 30. To disable this feature, set expiredObjectDeletionPeriodDays to 0.

1{
2  "enableDeltaUpdates": true,
3  "expiredObjectDeletionPeriodDays": 30
4}

Outputs

This integration will save the selected fields from your Actor to PostgreSQL.

Example configuration

Full Input Example for Website Content Crawler Actor with PostgreSQL integration

1{
2  "postgresSqlConnectionStr": "postgresql://postgres:password@localhost:5432/apify",
3  "postgresCollectionName": "apify-collection",
4  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
5  "embeddingsConfig": {
6    "model": "text-embedding-3-small"
7  },
8  "embeddingsProvider": "OpenAI",
9  "datasetFields": [
10    "text"
11  ],
12  "enableDeltaUpdates": true,
13  "deltaUpdatesPrimaryDatasetFields": ["url"],
14  "expiredObjectDeletionPeriodDays": 7,
15  "performChunking": true,
16  "chunkSize": 2000,
17  "chunkOverlap": 200
18}

PostgreSQL with PGVector

1{
2  "postgresSqlConnectionStr": "postgresql://postgres:password@localhost:5432/apify",
3  "postgresCollectionName": "apify-collection"
4}

OpenAI embeddings

1{
2  "embeddingsApiKey": "YOUR-OPENAI-API-KEY",
3  "embeddings": "OpenAI",
4  "embeddingsConfig": {"model":  "text-embedding-3-large"}
5}

Cohere embeddings

1{
2  "embeddingsApiKey": "YOUR-COHERE-API-KEY",
3  "embeddings": "Cohere",
4  "embeddingsConfig": {"model":  "embed-multilingual-v3.0"}
5}

Local postgres database with PostgreSQL with PGVector

To start a local PostgresSQL database with PGVector using Docker, refer to the docker-compose.yaml file and run the following command:

docker-compose up

You can connect to the database using psql with the following command:

psql -h localhost -p 5324 -U postgres -d apify

Or you can use PGAdmin to connect to the database.

docker run -e PGADMIN_DEFAULT_EMAIL=*@apify.com -e PGADMIN_DEFAULT_PASSWORD=root -p 8000:80 dpage/pgadmin4

LangChain uses the concept of collections to store data. Collections help to separate data for different projects or use cases. For each collection, two tables are created: langchain_pg_embedding and langchain_pg_collection. The langchain_pg_embedding table stores the embeddings, page_content, and associated metadata. The langchain_pg_collection table stores the list of collections. LangChain will automatically create these tables when the first embedding is saved to a collection.

Developer
Maintained by Apify
Actor metrics
  • 1 monthly user
  • 1 star
  • 78.9% runs succeeded
  • Created in Jun 2024
  • Modified 4 days ago