Pricing

Pay per usage

Try for free

Go to Store

OpenAI Vector Store Integration

Try for free

Developed by

Jiří Spilka

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

4.8 (5)

Pricing

Pay per usage

Issues response

16 hours

Last modified

6 months ago

Integrations

Open source

OpenAI Vector Store Integration (OpenAI Assistant)

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store (connected to the OpenAI Assistant). It assumes that you have already created a OpenAI Vector Store and you need to regularly update the files to provide up-to-date responses.

💡 Note: This Actor is meant to be used together with other Actors' integration sections. For instance, if you are using the Website Content Crawler, you can activate Vector Store Files integration to save web content (including docx, pptx, pdf and other files) for your OpenAI assistant.

Is there anything you find unclear or missing? Please don't hesitate to inform us by creating an issue.

You can easily run the OpenAI Vector Store Integration at the Apify Platform.

Read a detailed guide in the documentation or in blogpost How we built an enterprise support assistant using OpenAI and the Apify platform.

֎ How does OpenAI Assistant Integration work?

Data for the Vector Store and Assistant are provided by various Apify actors and can include web content, Docx, Pdf, Pptx, and other files.

The following image illustrates the Apify-OpenAI Vector Store integration:

Apify-OpenAI Vector Store integration

The integration process includes:

Loading data from an Apify Actor
Processing the data to comply with OpenAI Assistant limits (max. 1000 files, max 5,000,000 tokens)
Creating OpenAI Files
[Optional] Removing existing files from the Vector Store (specified by fileIdsToDelete and/or filePrefix)
Adding the newly created files to the vector store.
[Optional] Deleting existing files from the OpenAI files (specified by fileIdsToDelete and/or filePrefix)

💰 How much does it cost?

Find the average usage cost for this actor on the pricing page under the Which plan do I need? section. Additional costs are associated with the use of OpenAI Assistant. Please refer to their pricing for details.

Since the integration is designed to upload entire dataset as a OpenAI file, the cost is minimal, typically less than $0.01 per run.

✅ Before you start

To use this integration, ensure you have:

An OpenAI account and an OpenAI API KEY. Create a free account at OpenAI.
Created an OpenAI Vector Store. You will need vectorStoreId to run this integration.
[Optional] Created an OpenAI Assistant.

➡️ Inputs

Refer to input schema for details.

vectorStoreId - OpenAI Vector Store ID
openaiApiKey - OpenAI API key
assistantId: The ID of an OpenAI Assistant. This parameter is required only when a file exceeds the OpenAI size limit of 5,000,000 tokens (as of 2024-04-23). When necessary, the model associated with the assistant is utilized to count tokens and split the large file into smaller, manageable segments.
datasetFields - Array of datasetFields you want to save, e.g., ["url", "text", "metadata.title"].
filePrefix - Delete and create files using a filePrefix, streamlining vector store updates.
fileIdsToDelete - Delete specified file IDs from vector store as needed.
datasetId: [Debug] Apify's Dataset ID (when running Actor as standalone without integration).
keyValueStoreId: [Debug] Apify's Key Value Store ID (when running Actor as standalone without integration).
saveInApifyKeyValueStore: [Debug] Save all created files in the Apify Key-Value Store to easily check and retrieve all files (this is typically used when debugging)

⬅️ Outputs

This integration saves selected datasetFields from your Actor to the OpenAI Assistant and optionally to Actor Key Value Storage (useful for debugging).

💾 Save data from Website Content Crawler to OpenAI Vector Store

To use this integration, you need an OpenAI account and an OpenAI API KEY. Additionally, you need to create an OpenAI Vector Store (vectorStoreId).

The Website Content Crawler can deeply crawl websites and save web page content to Apify's dataset. It also stores files such as PDFs, PPTXs, and DOCXs. A typical run crawling https://platform.openai.com/docs/assistants/overview includes the following dataset fields (truncated for brevity):

[
  {
    "url": "https://platform.openai.com/docs/assistants/overview",
    "text": "Assistants overview - OpenAI API\nThe Assistants API allows you to build AI assistants within your own applications ..."
  },
  {
    "url": "https://platform.openai.com/docs/assistants/overview/step-1-create-an-assistant",
    "text": "Assistants overview - OpenAI API\n An Assistant has instructions and can leverage models, tools, and files to respond to user queries ..."
  }
]

Once you have the dataset, you can store the data in the OpenAI Vector Store. Specify which fields you want to save to the OpenAI Vector Store, e.g., ["text", "url"].

{
  "assistantId": "YOUR-ASSISTANT-ID",
  "datasetFields": ["text", "url"],
  "openaiApiKey": "YOUR-OPENAI-API-KEY",
  "vectorStoreId": "YOUR-VECTOR-STORE-ID"
}

🔄 Update existing files in the OpenAI Vector Store

There are two ways to update existing files in the OpenAI Vector Store. You can either delete all files with a specific prefix or delete specific files by their IDs. It is more convenient to use the filePrefix parameter to delete and create files with the same prefix. In the first run, the integration will save all the files with the prefix openai_assistant_. In the next run, it will delete all the files with the prefix openai_assistant_ and create new files.

The settings for the integration are as follows:

{
  "assistantId": "YOUR-ASSISTANT-ID",
  "datasetFields": ["text", "url"],
  "filePrefix": "openai_assistant_",
  "openaiApiKey": "YOUR-OPENAI-API-KEY",
  "vectorStoreId": "YOUR-VECTOR-STORE-ID"
}

📦 Save Amazon Products to OpenAI Vector Store

You can also save Amazon products to the OpenAI Vector Store. Again, you need to have an OpenAI account and an OpenAI API KEY with a created OpenAI Vector Store (vectorStoreId).

To scrape Amazon products, you can use the Amazon Product Scraper Actor.

Let's say that you want to scrape "Apple Watch" and store all the scraped data in the OpenAI Assistant. For the product URL https://www.amazon.com/s?k=apple+watch, the scraper can yield the following results (truncated for brevity):

[
  {
    "title": "Apple Watch Ultra 2 [GPS + Cellular 49mm] Smartwatch with Rugged Titanium Case ....",
    "asin": "B0CSVGK51Y",
    "brand": "Apple",
    "stars": 4.7,
    "reviewsCount": 357,
    "thumbnailImage": "https://m.media-amazon.com/images/I/81pjcQFaDJL.__AC_SY445_SX342_QL70_FMwebp_.jpg",
    "price": {
      "value": 794,
      "currency": "$"
    },
    "url": "https://www.amazon.com/dp/B0CSVGK51Y"
  }
]

You can easily save the data to the OpenAI Vector Store by creating an integration (in the Amazon Product Scraper integration section) and specifying the fields you want to save:

{
  "assistantId": "YOUR-ASSISTANT-ID",
  "datasetFields": ["title", "brand", "stars", "reviewsCount", "thumbnailImage", "price.value", "price.currency", "url"],
  "openaiApiKey": "YOUR-OPENAI-API-KEY",
  "vectorStoreId": "YOUR-VECTOR-STORE-ID"
}

ⓘ Limitations

Crawled files, such as PDFs, PPTXs, and DOCXs, are saved in the OpenAI Vector Store as single files and uploaded one by one. While this approach is inefficient, it allows for better error handling and the ability to log detailed error messages.
OpenAI can process text-based PDF files but cannot handle PDF images or scanned PDFs. For the latter, you need to use OCR to extract text from images.

On this page

OpenAI Vector Store Integration (OpenAI Assistant)

Share Actor:

GPT Browser

anchor/gpt-browser

A GPT browser to use OpenAI prompt on any website. Put a list of URLs and a prompt, then the GPT agent will give you the answer you need. Fast, easy, and not limited with OpenAI ChatGPT restrictions. The best way to search and use GPT on large number of websites. Upload Excel or CSV. Screenshots 📸

Anchor

Auto GPT

lukaskrivka/auto-gpt

Run Auto GPT sessions directly on Apify. No OpenAI account or API token is required! Store parsed thoughts into datasets for later analysis.

Lukáš Křivka

199

Mastra.ai MCP Agent

jakub.kopecky/actor-mastra-mcp-agent

🤖 AI agent using mastra.ai with Apify MCP Server. 🚀 Runs queries via OpenAI models, taps Apify Actors for web data, and outputs to datasets. 🛠️

Jakub Kopecký

GPT Scraper

drobnikj/gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

4.1

Send FCM

martin.forejt/send-fcm

This actor can be used as integration with Firebase Cloud Messaging (FCM). It sends a message (aka push notification) to a device, group of devices or topics. The message can be fully customised supporting all FCM options.

Martin Forejt

5.0

🔍 GPT Search [Private API]

openapi/gpt-search-private-api

Use OpenAI's GPT4o Search mode via API! No cookie or proxy is required. Fast, cheap and reliable.

Open API

5.0

GPT Search

tri_angle/gpt-search

Send queries to ChatGPT and retrieve structured answers with full source citations. Easily integrate into your tools or workflows for flexible, scalable AI-powered solutions.

Tri⟁angle

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

1.5K

4.1

AI Product Recommendation Agent

matymar/ai-product-recommendation-agent

The AI Product Recommendation Agent helps users find the best products based on their needs using a simple query. It analyzes product listings, reviews, and ratings to provide well-informed recommendations.

Matouš Mařík

5.0

Dataset Query Engine

jiri.spilka/dataset-query-engine

Use natural language queries to retrieve results from an Apify dataset. This Actor provides a query engine that loads a dataset, executes SQL queries, and synthesizes results.