OpenSearch Integration

Pricing

Pay per usage

Try for free

Go to Apify Store

OpenSearch Integration

Try for free

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Pricing

Pay per usage

Rating

4.4

(5)

Developer

Apify

Maintained by Apify

Actor stats

Bookmarked

Total users

Monthly active users

9 months ago

Last modified

OpenSearch URL

openSearchUrlstringRequired

The URL of the Amazon OpenSearch Service instance to connect to

AWS Access Key ID

awsAccessKeyIdstringRequired

The AWS access key ID for the Amazon OpenSearch Service

AWS Secret Access Key

awsSecretAccessKeystringRequired

The AWS secret access key for the Amazon OpenSearch Service

OpenSearch Index Name

openSearchIndexNamestringRequired

The name of the index in the Amazon OpenSearch Service where the data will be stored

Auto-create index

autoCreateIndexbooleanOptional

When set to true, the integration will automatically create the index if it does not exist in the Amazon OpenSearch Service instance

Default value of this property is true

AWS Region

awsRegionstringOptional

The AWS region where the Amazon OpenSearch Service instance is located

Default value of this property is "us-east-1"

AWS Service Name

awsServiceNameEnumOptional

The AWS service name for the Amazon OpenSearch Service

Value options:

"aoss": string"es": string

Default value of this property is "aoss"

Use SSL

useSslbooleanOptional

When set to true, the integration will use SSL to connect to the Amazon OpenSearch Service instance

Default value of this property is true

Verify SSL certificates

verifyCertsbooleanOptional

When set to true, the integration will verify SSL certificates when connecting to the Amazon OpenSearch Service instance

Default value of this property is true

Use AWS4 authentication

useAWS4AuthbooleanOptional

When enabled, the integration will use AWS4 authentication to connect to the Amazon OpenSearch Service instance.

Note: If you are connecting to an OpenSearch Service instance that is not hosted on AWS, set this to false. In this case, AWS credentials are not required and will be ignored. You can provide dummy values for awsAccessKeyId and awsSecretAccessKey.

Default value of this property is true

Embeddings provider (as defined in the langchain API)

embeddingsProviderEnumRequired

Choose the embeddings provider to use for generating embeddings

Value options:

"OpenAI": string"Cohere": string

Default value of this property is "OpenAI"

Configuration for embeddings provider

embeddingsConfigobjectOptional

Configure the parameters for the LangChain embedding class. Key points to consider:

Typically, you only need to specify the model name. For example, for OpenAI, set the model name as {"model": "text-embedding-3-small"}.
It's required to ensure that the vector size of your embeddings matches the size of embeddings in the database.
Here are examples of embedding models:
- OpenAI: text-embedding-3-small, text-embedding-3-large, etc.
- Cohere: embed-english-v3.0, embed-multilingual-light-v3.0, etc.
For more details about other parameters, refer to the LangChain documentation.

Embeddings API KEY (whenever applicable, depends on provider)

embeddingsApiKeystringRequired

Value of the API KEY for the embeddings provider (if required).

For example for OpenAI it is OPENAI_API_KEY, for Cohere it is COHERE_API_KEY)

Dataset fields to select from the dataset results and store in the database

datasetFieldsarrayRequired

This array specifies the dataset fields to be selected and stored in the vector store. Only the fields listed here will be included in the vector store.

For instance, when using the Website Content Crawler, you might choose to include fields such as text, url, and metadata.title in the vector store.

Default value of this property is ["text"]

Dataset fields to select from the dataset and store as metadata in the database

metadataDatasetFieldsobjectOptional

A list of dataset fields which should be selected from the dataset and stored as metadata in the vector stores.

For example, when using the Website Content Crawler, you might want to store url in metadata. In this case, use metadataDatasetFields parameter as follows {"url": "url"}

Custom object to be stored as metadata in the vector store database

metadataObjectobjectOptional

This object allows you to store custom metadata for every item in the vector store.

For example, if you want to store the domain as metadata, use the metadataObject like this: {"domain": "apify.com"}.

Dataset ID

datasetIdstringOptional

Dataset ID (when running standalone without integration)

Update strategy (add, upsert, deltaUpdates (default))

dataUpdatesStrategyEnumOptional

Choose the update strategy for the integration. The update strategy determines how the integration updates the data in the database.

The available options are:

Add data (add):
- Always adds new records to the database.
- No checks for existing records or updates are performed.
- Useful when appending data without concern for duplicates.
Upsert data (upsert):
- Updates existing records if they match a key or identifier.
- Inserts new records into the database if they don't already exist.
- Ideal for ensuring the database contains the most up-to-date data, avoiding duplicates.
Update changed data based on deltas (deltaUpdates):
- Performs incremental updates by identifying differences (deltas) between the new dataset and the existing records.
- Only adds new records and updates those that have changed.
- Unchanged records are left untouched.
- Maximizes efficiency by reducing unnecessary updates.

Select the strategy that best fits your use case.

Value options:

"add": string"upsert": string"deltaUpdates": string

Default value of this property is "deltaUpdates"

Dataset fields to uniquely identify dataset items (only relevant when dataUpdatesStrategy is `upsert` or `deltaUpdates`)

dataUpdatesPrimaryDatasetFieldsarrayOptional

This array contains fields that are used to uniquely identify dataset items, which helps to handle content changes across different runs.

For instance, in a web content crawling scenario, the url field could serve as a unique identifier for each item.

Default value of this property is ["url"]

Enable incremental updates for objects based on deltas (deprecated)

enableDeltaUpdatesbooleanOptional

When set to true, this setting enables incremental updates for objects in the database by comparing the changes (deltas) between the crawled dataset items and the existing objects, uniquely identified by the datasetKeysToItemId field.

The integration will only add new objects and update those that have changed, reducing unnecessary updates. The datasetFields, metadataDatasetFields, and metadataObject fields are used to determine the changes.

Default value of this property is true

Dataset fields to uniquely identify dataset items (only relevant when `enableDeltaUpdates` is enabled) (deprecated)

deltaUpdatesPrimaryDatasetFieldsarrayOptional

This array contains fields that are used to uniquely identify dataset items, which helps to handle content changes across different runs.

For instance, in a web content crawling scenario, the url field could serve as a unique identifier for each item.

Default value of this property is ["url"]

Delete expired objects from the database

deleteExpiredObjectsbooleanOptional

When set to true, delete objects from the database that have not been crawled for a specified period.

Default value of this property is true

Delete expired objects from the database after a specified number of days

expiredObjectDeletionPeriodDaysintegerOptional

This setting allows the integration to manage the deletion of objects from the database that have not been crawled for a specified period. It is typically used in subsequent runs after the initial crawl.

When the value is greater than 0, the integration checks if objects have been seen within the last X days (determined by the expiration period). If the objects are expired, they are deleted from the database. The specific value for deletedExpiredObjectsDays depends on your use case and how frequently you crawl data.

For example, if you crawl data daily, you can set deletedExpiredObjectsDays to 7 days. If you crawl data weekly, you can set deletedExpiredObjectsDays to 30 days.

Default value of this property is 30

Enable text chunking

performChunkingbooleanOptional

When set to true, the text will be divided into smaller chunks based on the settings provided below. Proper chunking helps optimize retrieval and ensures accurate and efficient responses.

Default value of this property is true

Maximum chunk size

chunkSizeintegerOptional

Defines the maximum number of characters in each text chunk. Choosing the right size balances between detailed context and system performance. Optimal sizes ensure high relevancy and minimal response time.

Default value of this property is 2000

Chunk overlap

chunkOverlapintegerOptional

Specifies the number of overlapping characters between consecutive text chunks. Adjusting this helps maintain context across chunks, which is crucial for accuracy in retrieval-augmented generation systems.

Default value of this property is 0

Pinecone Integration

apify/pinecone-integration

This integration transfers data from Apify Actors to a Pinecone and is a good starting point for a question-answering, search, or RAG use case.

Apify

531

3.2

Fandom & Wikipedia Extractor

jupri/wiki-scraper

Scrape content from Fandom.com and Wikipedia.com

cat

118

Wikipedia Search & Content Scraper

tuningsearch/wikipedia-search-scraper

🔥 Only $0.5 per 1,000 results 🔥 **CHEAPEST** Wikipedia Search + Full Page Scraper! 🔍 Search 100 results per query across 70 languages 📄 Extract complete page content in Markdown format ⚡ Lightning-fast batch processing with zero failure charges!

tuningsearch

Instagram Stories Scraper

deepanshusharm/instagram-stories-scraper

A Scraper that downloads Instagram stories from multiple users with comprehensive metadata extraction.

Deepanshu Sharma

Wikipedia-scraper

pluzgi/wikipedia-scraper

The scraper searches Wikipedia for a given term, extracts the titles and URLs of search results, and retrieves the last modification date from each page.

pluzgi

Wikipedia MCP Server

agentify/wikipedia-mcp-server

MCP server for Wikipedia, providing LLMs and clients with real-time access to Wikipedia articles, summaries, sections, and related information via Apify Actor.

agentify

Chroma Integration

apify/chroma-integration

This integration transfers data from Apify Actors to a Chroma and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.5

Weaviate Integration

apify/weaviate-integration

This integration transfers data from Apify Actors to a Weaviate and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.6

Opengauss Integration

wyswyz/opengauss-integration

This integration transfers data from Apify Actors to an openGauss database and is a good starting point for a question-answering, search, or RAG use case.

Y Wang

Milvus Integration

apify/milvus-integration

This integration transfers data from Apify Actors to a Milvus/Zilliz database and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.5

OpenSearch Integration

OpenSearch Integration

OpenSearch URL

AWS Access Key ID

AWS Secret Access Key

OpenSearch Index Name

Auto-create index

AWS Region

AWS Service Name

Value options:

Use SSL

Verify SSL certificates

Use AWS4 authentication

Embeddings provider (as defined in the langchain API)

Value options:

Configuration for embeddings provider

Embeddings API KEY (whenever applicable, depends on provider)

Dataset fields to select from the dataset results and store in the database

Dataset fields to select from the dataset and store as metadata in the database

Custom object to be stored as metadata in the vector store database

Dataset ID

Update strategy (add, upsert, deltaUpdates (default))

Value options:

Dataset fields to uniquely identify dataset items (only relevant when dataUpdatesStrategy is `upsert` or `deltaUpdates`)

Enable incremental updates for objects based on deltas (deprecated)

Dataset fields to uniquely identify dataset items (only relevant when `enableDeltaUpdates` is enabled) (deprecated)

Delete expired objects from the database

Delete expired objects from the database after a specified number of days

Enable text chunking

Maximum chunk size

Chunk overlap

You might also like

Pinecone Integration

Fandom & Wikipedia Extractor

Wikipedia Search & Content Scraper

Instagram Stories Scraper

Wikipedia-scraper

Wikipedia MCP Server

Chroma Integration

Weaviate Integration

Opengauss Integration

Milvus Integration

OpenSearch URL

AWS Access Key ID

AWS Secret Access Key

OpenSearch Index Name

Auto-create index

AWS Region

AWS Service Name

Value options:

Use SSL

Verify SSL certificates

Use AWS4 authentication

Embeddings provider (as defined in the langchain API)

Value options:

Configuration for embeddings provider

Embeddings API KEY (whenever applicable, depends on provider)

Dataset fields to select from the dataset results and store in the database

Dataset fields to select from the dataset and store as metadata in the database

Custom object to be stored as metadata in the vector store database

Dataset ID

Update strategy (add, upsert, deltaUpdates (default))

Value options:

Dataset fields to uniquely identify dataset items (only relevant when dataUpdatesStrategy is `upsert` or `deltaUpdates`)

Enable incremental updates for objects based on deltas (deprecated)

Dataset fields to uniquely identify dataset items (only relevant when `enableDeltaUpdates` is enabled) (deprecated)

Delete expired objects from the database

Delete expired objects from the database after a specified number of days

Enable text chunking

Maximum chunk size

Chunk overlap