The database is configured properly. The vector dimension of your embedding model in the Actor input (Embedding settings → model) matches the one set up in the database. Error message: Error code: 400 - {'error': {'message': 'Requested 1,073,396 tokens, max 600,000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}

So, when I push a database that is larger than 600,000 tokens, it fails. Not just one chunk—even if all chunks are 500 tokens long, if the total exceeds 600,000 tokens, it fails.

I wonder why? Who sets this limit? And is it possible to either:

Create a workaround so that after 600k tokens, it creates a new request? Increase the limit?

And another thing: Instead of updating chunks, why is it not possible to just retrieve all chunks with matching URLs, push new ones, and delete the old ones? Would this not minimize the loading cost on Apify and reduce costs on Pinecone, where requesting so many chunks is quite expensive, in this case we would only request the url?

team2

????

responsible_box

Jiří Spilka (jiri.spilka)

Hi,

Sorry for the late response—I’ve been extremely busy with high-priority tasks over the past few days.

I looked into the issue, and it is partially related to this: Issue Link.

We are using the official Pinecone implementation.

For OpenAI embeddings, use pool_threads>4 when constructing the pinecone.Index,
        embedding_chunk_size>1000 and batch_size~64 for best performance.
        Args:
            texts: Iterable of strings to add to the vectorstore.
            metadatas: Optional list of metadatas associated with the texts.
            ids: Optional list of ids to associate with the texts.
            namespace: Optional pinecone namespace to add the texts to.
            batch_size: Batch size to use when adding the texts to the vectorstore.
            embedding_chunk_size: Chunk size to use when embedding the texts.
            async_req: Whether runs asynchronously.
            id_prefix: Optional string to use as an ID prefix when upserting vectors.

I can expose these parameters and will implement the fix on Monday.

Apologies for the inconvenience.

responsible_box

Thanks, my pinecone read and write is getting quite expensive, is there a way to need to request and push less? I am often only changing a few URLs of of 700 or so

responsible_box

Looking forward to seeing this :)

Jiří Spilka (jiri.spilka)

I have it almost ready. I'm currently in the process of correcting the unit tests and expect to have it completed by tonight. I will keep you updated on the progress.

Jiří Spilka (jiri.spilka)

Hi,

Sorry for the delayed response.

It took a bit longer as I decided to implement this functionality:

And another thing: Instead of updating chunks, why is it not possible to just retrieve all chunks with matching URLs, push new ones, and delete the old ones? Would this not minimize the loading cost on Apify and reduce costs on Pinecone, where requesting so many chunks is quite expensive? In this case, we would only request the URL.

This is now available in Beta 0.0.59 with the following changes:

embeddingBatchSize (Pinecone only) – Batch size for embedding texts. Default: 1000, Minimum: 1.
usePineconeIdPrefix (Pinecone only) – Optimizes delta updates using a Pinecone ID prefix (item_id#chunk_id) when enableDeltaUpdates is true. Works only when the database is empty.
New parameter dataUpdatesStrategy:
- Replaces enableDeltaUpdates.
- Automatically set to deltaUpdates if enableDeltaUpdates = true.
- Options: deltaUpdates, add, or upsert.
Renamed deltaUpdatesPrimaryDatasetFields → dataUpdatesPrimaryDatasetFields:
- Automatically migrated if the old field is present.
Backward Compatibility:
- Supports legacy enableDeltaUpdates mappings and deltaUpdatesPrimaryDatasetFields.

I have also update documentation:

Configure update strategy

To control how the integration updates data in the database, use the dataUpdatesStrategy parameter. This parameter allows you to choose between different update strategies based on your use case, such as adding new data, upserting records, or incrementally updating records based on changes (deltas). Below are the available strategies and explanations for when to use each:

Add data (add):
- Appends new data to the database without checking for duplicates or updating existing records.
- Suitable for cases where deduplication or updates are unnecessary, and the data simply needs to be added.
- For example, you might use this strategy to continually append data from independent crawls without regard for overlaps.
Upsert data (upsert):
- Updates existing records in the database if they match a key or identifier and inserts new records if they don’t already exist.
- Ideal when you want to maintain accurate and up-to-date data while avoiding duplication.
- For instance, this is useful in cases where unique items (such as user profiles or documents) need to be managed, ensuring the database reflects the latest changes.
- Check the dataUpdatePrimaryDatasetFields parameter to specify which fields are used to uniquely identify each dataset item.
Delta updates (deltaUpdates):
- Incrementally updates records by identifying differences (deltas) between the new dataset and the existing database records.
- Ensures only new or modified records are processed, leaving unchanged records untouched. This minimizes unnecessary database operations and improves efficiency.
- This is the most efficient strategy when integrating data that evolves over time, such as website content or recurring crawls.
- Check the dataUpdatePrimaryDatasetFields parameter to specify which fields are used to uniquely identify each dataset item.

I have tested this with my (small) Pinecone database and unit tests.

Please let me know if everything is working as expected, and I’ll proceed with the release.

Best,
Jiri

Jiří Spilka (jiri.spilka)

Sorry for the long version—here's the TL;DR:

Set embeddingBatchSize to 500 or a smaller value.
** Set dataUpdatesStrategy to 'upsert' to delete old entries and add new ones.
Important: Ensure dataUpdatesPrimaryDatasetFields is set up correctly.

Jiří Spilka (jiri.spilka)

Hi, were you able to try this? Thank you. Jiri

Jiří Spilka (jiri.spilka)

Hi,
a new version with these changes has been released as the latest. I’ll go ahead and close this issue for now.
Please let me know if you have other questions. Jiri

Add comment

Milvus Integration

apify/milvus-integration

This integration transfers data from Apify Actors to a Milvus/Zilliz database and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.5

Chroma Integration

apify/chroma-integration

This integration transfers data from Apify Actors to a Chroma and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.8

Weaviate Integration

apify/weaviate-integration

This integration transfers data from Apify Actors to a Weaviate and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.7

OpenAI Vector Store Integration

jiri.spilka/openai-vector-store-integration

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

Jiří Spilka

180

4.8

Actors MCP Server

apify/actors-mcp-server

⚠️ Legacy: This Actor is outdated. For the latest features and full documentation, visit https://mcp.apify.com. Easily connect any Apify Actor to AI agents using Anthropic’s Model Context Protocol (MCP) with our actively maintained MCP server.

Apify

1.9K

4.9

OpenSearch Integration

apify/opensearch-integration

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Apify

4.4

MCP Stress Tester

jakub.kopecky/mcp-stress-tester

A simple MCP Stress Tester client Actor for stress-testing your Model Context Protocol server. 💻⚡

Jakub Kopecký

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs. Supports Model Context Protocol (MCP).

Apify

4.6K

4.4

WCC Pinecone Integration

tri_angle/wcc-pinecone-integration

Crawl any website and store its content in your Pinecone vector database. Enhance the accuracy and reliability of your own AI Assistant with facts fetched from external sources or connect this integration to our Pinecone GPT Chatbot assistant available in Apify Store.