PGVector Integration avatar
PGVector Integration
Try for free

No credit card required

View all Actors
PGVector Integration

PGVector Integration

apify/pgvector-integration
Try for free

No credit card required

This integration transfers data from Apify Actors to a Postgres SQL database (with PGVector extension).

Postgres SQL connection string

postgresSqlConnectionStrstringRequired

Connection string for the Postgres SQL database in the format postgresql://user:password@host:port/database

Postgres SQL collection name

postgresCollectionNamestringRequired

The name of the collection to use. NOTE: This is not the name of the table, but the name of the collection

Embeddings provider (as defined in the langchain API)

embeddingsProviderEnumRequired

Choose the embeddings provider to use for generating embeddings

Value options:

"OpenAI": string"Cohere": string

Default value of this property is "OpenAI"

Configuration for embeddings provider

embeddingsConfigobjectOptional

Configure the parameters for the LangChain embedding class. Key points to consider:

  1. Typically, you only need to specify the model name. For example, for OpenAI, set the model name as {"model": "text-embedding-3-small"}.

  2. It's crucial to ensure that the vector size of your embeddings matches the size of embeddings in the database.

  3. Here are some examples of embedding models:

    • OpenAI: text-embedding-3-small, text-embedding-3-large, etc.
    • Cohere: embed-english-v3.0, embed-multilingual-light-v3.0, etc.
  4. For more details about other parameters, refer to the LangChain documentation.

Embeddings API KEY (whenever applicable, depends on provider)

embeddingsApiKeystringRequired

Value of the API KEY for the embeddings provider (if required).

For example for OpenAI it is OPENAI_API_KEY, for Cohere it is COHERE_API_KEY)

Dataset fields to select from the dataset results and store in the database

datasetFieldsarrayRequired

This array specifies the dataset fields to be selected and stored in the vector store. Only the fields listed here will be included in the vector store.

For instance, when using the Website Content Crawler, you might choose to include fields such as text, url, and metadata.title in the vector store.

Default value of this property is ["text"]

Dataset fields to select from the dataset and store as metadata in the database

metadataDatasetFieldsobjectOptional

A list of dataset fields which should be selected from the dataset and stored as metadata in the vector stores.

For example, when using the Website Content Crawler, you might want to store url in metadata. In this case, use metadataDatasetFields parameter as follows {"url": "url"}

Custom object to be stored as metadata in the vector store database

metadataObjectobjectOptional

This object allows you to store custom metadata for every item in the vector store.

For example, if you want to store the domain as metadata, use the metadataObject like this: {"domain": "apify.com"}.

Dataset ID

datasetIdstringOptional

Dataset ID (when running standalone without integration)

Enable incremental updates for objects based on deltas

enableDeltaUpdatesbooleanOptional

When set to true, this setting enables incremental updates for objects in the database by comparing the changes (deltas) between the crawled dataset items and the existing objects, uniquely identified by the datasetKeysToItemId field.

The integration will only add new objects and update those that have changed, reducing unnecessary updates. The datasetFields, metadataDatasetFields, and metadataObject fields are used to determine the changes.

Default value of this property is true

Dataset fields to uniquely identify dataset items (only relevant when `enableDeltaUpdates` is enabled)

deltaUpdatesPrimaryDatasetFieldsarrayOptional

This array contains fields that are used to uniquely identify dataset items, which helps to handle content changes across different runs.

For instance, in a web content crawling scenario, the url field could serve as a unique identifier for each item.

Default value of this property is ["url"]

Delete expired objects from the database after a specified number of days (only relevant when `enableDeltaUpdates` is enabled)

expiredObjectDeletionPeriodDaysintegerOptional

This setting allows the integration to manage the deletion of objects from the database that have not been crawled for a specified period. It is typically used in subsequent runs after the initial crawl.

When the value is greater than 0, the integration checks if objects have been seen within the last X days (determined by the expiration period). If the objects are expired, they are deleted from the database. The specific value for deletedExpiredObjectsDays depends on your use case and how frequently you crawl data.

For example, if you crawl data daily, you can set deletedExpiredObjectsDays to 7 days. If you crawl data weekly, you can set deletedExpiredObjectsDays to 30 days.

Setting deletedExpiredObjectsDays to 0 disables this feature

Default value of this property is 30

Enable text chunking

performChunkingbooleanOptional

When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.

Default value of this property is false

Maximum chunk size

chunkSizeintegerOptional

Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.

Default value of this property is 1000

Chunk overlap

chunkOverlapintegerOptional

Specifies the number of overlapping characters between consecutive text chunks. Adjusting this helps maintain context across chunks, which is crucial for accuracy in retrieval-augmented generation systems.

Default value of this property is 0

Developer
Maintained by Apify
Actor metrics
  • 1 monthly user
  • 1 star
  • 78.9% runs succeeded
  • Created in Jun 2024
  • Modified 4 days ago