Pinecone Integration avatar
Pinecone Integration

Pricing

Pay per usage

Go to Store
Pinecone Integration

Pinecone Integration

apify/pinecone-integration

Developed by

Apify

Maintained by Apify

This integration transfers data from Apify Actors to a Pinecone and is a good starting point for a question-answering, search, or RAG use case.

4.6 (5)

Pricing

Pay per usage

29

Monthly users

124

Runs succeeded

94%

Response time

4 days

Last modified

13 days ago

TE

Max token lentght...

Open
team2 opened this issue
a month ago

I cant push big databases, it seems like they think it is one long chunk, but a lot of smaller chunks??? Hope there is a solution!

TE

team2

a month ago

For more information, I get this error:

Failed to update the database. Please ensure the following:

The database is configured properly. The vector dimension of your embedding model in the Actor input (Embedding settings → model) matches the one set up in the database. Error message: Error code: 400 - {'error': {'message': 'Requested 1,073,396 tokens, max 600,000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}

So, when I push a database that is larger than 600,000 tokens, it fails. Not just one chunk—even if all chunks are 500 tokens long, if the total exceeds 600,000 tokens, it fails.

I wonder why? Who sets this limit? And is it possible to either:

Create a workaround so that after 600k tokens, it creates a new request? Increase the limit?

And another thing: Instead of updating chunks, why is it not possible to just retrieve all chunks with matching URLs, push new ones, and delete the old ones? Would this not minimize the loading cost on Apify and reduce costs on Pinecone, where requesting so many chunks is quite expensive, in this case we would only request the url?

TE

team2

a month ago

????

RB

responsible_box

a month ago

:)

RB

responsible_box

a month ago

:(

jiri.spilka avatar

Hi,

Sorry for the late response—I’ve been extremely busy with high-priority tasks over the past few days.

I looked into the issue, and it is partially related to this: Issue Link.

We are using the official Pinecone implementation.

1For OpenAI embeddings, use pool_threads>4 when constructing the pinecone.Index,
2        embedding_chunk_size>1000 and batch_size~64 for best performance.
3        Args:
4            texts: Iterable of strings to add to the vectorstore.
5            metadatas: Optional list of metadatas associated with the texts.
6            ids: Optional list of ids to associate with the texts.
7            namespace: Optional pinecone namespace to add the texts to.
8            batch_size: Batch size to use when adding the texts to the vectorstore.
9            embedding_chunk_size: Chunk size to use when embedding the texts.
10            async_req: Whether runs asynchronously.
11            id_prefix: Optional string to use as an ID prefix when upserting vectors.

I can expose these parameters and will implement the fix on Monday.

Apologies for the inconvenience.

RB

responsible_box

a month ago

Thanks, my pinecone read and write is getting quite expensive, is there a way to need to request and push less? I am often only changing a few URLs of of 700 or so

RB

responsible_box

a month ago

Looking forward to seeing this :)

jiri.spilka avatar

I have it almost ready. I'm currently in the process of correcting the unit tests and expect to have it completed by tonight. I will keep you updated on the progress.

jiri.spilka avatar

Hi,

Sorry for the delayed response.

It took a bit longer as I decided to implement this functionality:

And another thing: Instead of updating chunks, why is it not possible to just retrieve all chunks with matching URLs, push new ones, and delete the old ones? Would this not minimize the loading cost on Apify and reduce costs on Pinecone, where requesting so many chunks is quite expensive? In this case, we would only request the URL.

This is now available in Beta 0.0.59 with the following changes:

  • embeddingBatchSize (Pinecone only) – Batch size for embedding texts. Default: 1000, Minimum: 1.
  • usePineconeIdPrefix (Pinecone only) – Optimizes delta updates using a Pinecone ID prefix (item_id#chunk_id) when enableDeltaUpdates is true. Works only when the database is empty.
  • New parameter dataUpdatesStrategy:
    • Replaces enableDeltaUpdates.
    • Automatically set to deltaUpdates if enableDeltaUpdates = true.
    • Options: deltaUpdates, add, or upsert.
  • Renamed deltaUpdatesPrimaryDatasetFieldsdataUpdatesPrimaryDatasetFields:
    • Automatically migrated if the old field is present.
  • Backward Compatibility:
    • Supports legacy enableDeltaUpdates mappings and deltaUpdatesPrimaryDatasetFields.

I have also update documentation:

Configure update strategy

To control how the integration updates data in the database, use the dataUpdatesStrategy parameter. This parameter allows you to choose between different update strategies based on your use case, such as adding new data, upserting records, or incrementally updating records based on changes (deltas). Below are the available strategies and explanations for when to use each:

  • Add data (add):

    • Appends new data to the database without checking for duplicates or updating existing records.
    • Suitable for cases where deduplication or updates are unnecessary, and the data simply needs to be added.
    • For example, you might use this strategy to continually append data from independent crawls without regard for overlaps.
  • Upsert data (upsert):

    • Updates existing records in the database if they match a key or identifier and inserts new records if they don’t already exist.
    • Ideal when you want to maintain accurate and up-to-date data while avoiding duplication.
    • For instance, this is useful in cases where unique items (such as user profiles or documents) need to be managed, ensuring the database reflects the latest changes.
    • Check the dataUpdatePrimaryDatasetFields parameter to specify which fields are used to uniquely identify each dataset item.
  • Delta updates (deltaUpdates):

    • Incrementally updates records by identifying differences (deltas) between the new dataset and the existing database records.
    • Ensures only new or modified records are processed, leaving unchanged records untouched. This minimizes unnecessary database operations and improves efficiency.
    • This is the most efficient strategy when integrating data that evolves over time, such as website content or recurring crawls.
    • Check the dataUpdatePrimaryDatasetFields parameter to specify which fields are used to uniquely identify each dataset item.

I have tested this with my (small) Pinecone database and unit tests.

Please let me know if everything is working as expected, and I’ll proceed with the release.

Best,
Jiri

jiri.spilka avatar

Sorry for the long version—here's the TL;DR:

  • Set embeddingBatchSize to 500 or a smaller value.
  • ** Set dataUpdatesStrategy to 'upsert' to delete old entries and add new ones.
  • Important: Ensure dataUpdatesPrimaryDatasetFields is set up correctly.
jiri.spilka avatar

Hi, were you able to try this? Thank you. Jiri

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.