OpenAI Vector Store Integration
No credit card required
OpenAI Vector Store Integration
No credit card required
The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.
Do you want to learn more about this Actor?
Get a demoHello,
I've been using the OpenAI Vector Store Integration actor to process a large dataset of scraped pages from a previous Apify run. The dataset contains over 9,000 items, but when the actor attempts to create the vector store batch, it encounters an error due to exceeding the maximum allowed file_ids array length.
It appears that the actor tries to send all the file_ids in a single request to OpenAI's API, exceeding the limit of 500 file_ids per request. As a result, the actor fails to process the entire dataset, and only a portion of the data is inserted into the vector store.
To experince that issue, run a web scraping task that collects a large number of results (e.g., over 9,000 items). Use the OpenAI Vector Store Integration actor to process the dataset. Observe that the actor fails with the above error message.
The actor should handle large datasets by batching the file_ids into chunks of 500 or fewer when making the create_and_poll API requests to OpenAI. This would comply with OpenAI's API limitations and allow processing of large datasets without encountering the "array too long" error.
Run URL: https://console.apify.com/actors/runs/paRZQeERG1bHj3CqQ
Run URL Log: https://api.apify.com/v2/logs/paRZQeERG1bHj3CqQ
Hi, thank you for using the OpenAI Integration!
And thank you for the excellent explanations and examples—they were very helpful in quickly identifying the issue.
The fix is straightforward. I’ll bundle it with a few other changes I’ve been planning, test it, and release it tomorrow.
Hi, I’ve implemented the changes, and the Actor can now handle batch operations, but during testing on my crawl, the files were created, but attaching them to the vector store failed without providing a clear reason.
Here’s my log:
VectorStoreFileBatch(id='vsfb_d3d0bdb8cf1f4a8987514d91b1208e84', created_at=1732652119, file_counts=FileCounts(cancelled=0, completed=203, failed=297)
I’m afraid you might encounter a similar issue. I’ll need to investigate this further. My apologies for the inconvenience.
Hi, Thank you again for pointing out this issue.
I was able to fix it. The problem was that OpenAI doesn’t handle large batches of files well. I had to reduce the batch size to 100 to avoid many failures.
However, I still couldn’t determine the exact reason for some of the failures. To address this, I modified the code to upload files to the vector store one by one. It turns out that OpenAI cannot process PDF files that are represented as images (e.g., scanned PDFs). Only text-based PDFs can be added to the vector store.
In the latest version (0.0.38
), you now get detailed output that allows you to examine which files failed to upload to the vector store.
The trade-off is that the upload to the OpenAI Vector Store is now slower than before.
I hope this helps.
I’ll go ahead and close this issue. If you encounter any further difficulties or have additional questions, feel free to reach out.
Actor Metrics
30 monthly users
-
7 stars
78% runs succeeded
2.8 days response time
Created in Apr 2024
Modified 15 days ago