No credit card required
Pinecone integration
No credit card required
Simplify your data operations with this Apify and Pinecone integration. Easily push selected fields from your Apify Actor directly into any Pinecone index. If the index doesn't exist, the integration will create it. Practical and straightforward solution for handling data between Apify and Pinecone.
I believe I've entered all the configurations correctly to push my web scrape results to my Pinecone database, but the Pinecone vector count is 0 and I found an error in the run logs. I don't understand what I need to do to have the documents in Pinecone.
Log files
2023-11-15T20:32:19.661Z ACTOR: Pulling Docker image of build JOzHG9AAnkWKYnSo4 from repository. 2023-11-15T20:32:24.285Z ACTOR: Creating Docker container. 2023-11-15T20:32:24.906Z ACTOR: Starting Docker container. 2023-11-15T20:32:27.974Z INFO Initializing actor... 2023-11-15T20:32:27.975Z INFO System info ({"apify_sdk_version": "1.1.1", "apify_client_version": "1.3.0", "python_version": "3.11.6", "os": "linux"}) 2023-11-15T20:32:28.166Z Loading dataset 2023-11-15T20:32:28.168Z Metadata fields loaded {'source': None} 2023-11-15T20:32:28.181Z Dataset loaded for field text 2023-11-15T20:32:28.182Z Loading documents for field text 2023-11-15T20:32:31.096Z ERROR Actor failed with an exception 2023-11-15T20:32:31.098Z Traceback (most recent call last): 2023-11-15T20:32:31.099Z File "/usr/src/app/src/main.py", line 59, in main 2023-11-15T20:32:31.100Z documents = loader.load() 2023-11-15T20:32:31.101Z ^^^^^^^^^^^^^ 2023-11-15T20:32:31.102Z File "/usr/local/lib/python3.11/site-packages/langchain/document_loaders/apify_dataset.py", line 54, in load 2023-11-15T20:32:31.103Z return list(map(self.dataset_mapping_function, dataset_items)) 2023-11-15T20:32:31.104Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2023-11-15T20:32:31.105Z File "/usr/src/app/src/main.py", line 52, in
Document format
[ { "url": "https://www.domain.com/", "text": "lorem ipsum" }, { "url": "https://www.domain.com/", "text": "lorem ipsum" } ]
Try to remove metadata_fields.
Removed the metadata config, but still get an error.
2023-11-16T14:41:05.889Z ACTOR: Pulling Docker image of build JOzHG9AAnkWKYnSo4 from repository. 2023-11-16T14:41:10.573Z ACTOR: Creating Docker container. 2023-11-16T14:41:10.629Z ACTOR: Starting Docker container. 2023-11-16T14:41:13.489Z INFO Initializing actor... 2023-11-16T14:41:13.490Z INFO System info ({"apify_sdk_version": "1.1.1", "apify_client_version": "1.3.0", "python_version": "3.11.6", "os": "linux"}) 2023-11-16T14:41:13.662Z Loading dataset 2023-11-16T14:41:13.664Z Metadata fields loaded {} 2023-11-16T14:41:13.665Z ERROR Actor failed with an exception 2023-11-16T14:41:13.666Z Traceback (most recent call last): 2023-11-16T14:41:13.667Z File "/usr/src/app/src/main.py", line 49, in main 2023-11-16T14:41:13.668Z dataset_id=actor_input.get('payload')['resource']['defaultDatasetId'], 2023-11-16T14:41:13.669Z ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2023-11-16T14:41:13.669Z TypeError: 'NoneType' object is not subscriptable 2023-11-16T14:41:13.670Z INFO Exiting actor ({"exit_code": 91})
This was run from the Pinecone actor directly. Will I need to rerun the webscraper actor again? Doesn't seem like I should need to, but that's the only thing I can think of. Just didn't want to take the time, or spend the money if I didn't have to.
Also, without the metadata configured, how will the page urls be stored in pinecone? Traditionally I see them saved as metadata.
URLs are not in the results set unfortunately - this integration can't fetch it. It's about actor itself. Also, yes, you need to run run again.
Thanks for the quick response. Rerunning the webscraper now.
Feature requests:
- Reference stored datasets we've created in Apify in the storage section. I see they have id's to uniquely identify them. We shouldn't have to start from scratch every time there's an error. We're rebuilding a dataset that already exists wasting time and money.
- Open your actor up to more embedding models than just OpenAI. It would be nice to use HuggingFace models.
Also... the Website Content Crawler actor provides lots of good metadata. Would be nice to be able to save that into Pinecone.
{ "url": "https://www.company.com/", "crawl": { "loadedUrl": "https://www.company.com/", "loadedTime": "2023-11-14T19:12:10.154Z", "referrerUrl": "https://www.company.com/", "depth": 0, "httpStatusCode": 200 }, "metadata": { "canonicalUrl": "https://www.company.com/", "title": "Digital Product Growth | Experience Experts", "description": "Company builds technology-enabled solutions that propel businesses and delight customers.", "author": null, "keywords": null, "languageCode": "en-US" }, "screenshotUrl": null, "text": "we help companies grow...." },
Made another run with the webcrawler. The Pinecone Integration kicked off but failed. Logs seem to indicate there is metadata config, but that has been removed prior to the run. Please advise
2023-11-16T16:43:14.231Z ACTOR: Pulling Docker image of build JOzHG9AAnkWKYnSo4 from repository. 2023-11-16T16:43:18.786Z ACTOR: Creating Docker container. 2023-11-16T16:43:18.821Z ACTOR: Starting Docker container. 2023-11-16T16:43:21.319Z INFO Initializing actor... 2023-11-16T16:43:21.321Z INFO System info ({"apify_sdk_version": "1.1.1", "apify_client_version": "1.3.0", "python_version": "3.11.6", "os": "linux"}) 2023-11-16T16:43:21.464Z Loading dataset 2023-11-16T16:43:21.467Z Metadata fields loaded {'url': None} 2023-11-16T16:43:21.474Z Dataset loaded for field text 2023-11-16T16:43:21.477Z Loading documents for field text 2023-11-16T16:43:23.632Z ERROR Actor failed with an exception 2023-11-16T16:43:23.634Z Traceback (most recent call last): 2023-11-16T16:43:23.636Z File "/usr/src/app/src/main.py", line 59, in main 2023-11-16T16:43:23.638Z documents = loader.load() 2023-11-16T16:43:23.640Z ^^^^^^^^^^^^^ 2023-11-16T16:43:23.642Z File "/usr/local/lib/python3.11/site-packages/langchain/document_loaders/apify_dataset.py", line 54, in load 2023-11-16T16:43:23.644Z return list(map(self.dataset_mapping_function, dataset_items)) 2023-11-16T16:43:23.646Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2023-11-16T16:43:23.648Z File "/usr/src/app/src/main.py", line 52, in
I reviewed an actor and metadata_fields
field seemed to be buggy. I released new version fixing it. Also you can now optionally pass dataset_id
in your input schema and run this actor stand alone.
Thanks for the fixes. I'll give them a try.
- 17 monthly users
- 99.7% runs succeeded
- 11 days response time
- Created in May 2023
- Modified 5 days ago