I believe I've entered all the configurations correctly to push my web scrape results to my Pinecone database, but the Pinecone vector count is 0 and I found an error in the run logs. I don't understand what I need to do to have the documents in Pinecone.

davidlance

Log files

2023-11-15T20:32:19.661Z ACTOR: Pulling Docker image of build JOzHG9AAnkWKYnSo4 from repository. 2023-11-15T20:32:24.285Z ACTOR: Creating Docker container. 2023-11-15T20:32:24.906Z ACTOR: Starting Docker container. 2023-11-15T20:32:27.974Z INFO Initializing actor... 2023-11-15T20:32:27.975Z INFO System info ({"apify_sdk_version": "1.1.1", "apify_client_version": "1.3.0", "python_version": "3.11.6", "os": "linux"}) 2023-11-15T20:32:28.166Z Loading dataset 2023-11-15T20:32:28.168Z Metadata fields loaded {'source': None} 2023-11-15T20:32:28.181Z Dataset loaded for field text 2023-11-15T20:32:28.182Z Loading documents for field text 2023-11-15T20:32:31.096Z ERROR Actor failed with an exception 2023-11-15T20:32:31.098Z Traceback (most recent call last): 2023-11-15T20:32:31.099Z File "/usr/src/app/src/main.py", line 59, in main 2023-11-15T20:32:31.100Z documents = loader.load() 2023-11-15T20:32:31.101Z ^^^^^^^^^^^^^ 2023-11-15T20:32:31.102Z File "/usr/local/lib/python3.11/site-packages/langchain/document_loaders/apify_dataset.py", line 54, in load 2023-11-15T20:32:31.103Z return list(map(self.dataset_mapping_function, dataset_items)) 2023-11-15T20:32:31.104Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2023-11-15T20:32:31.105Z File "/usr/src/app/src/main.py", line 52, in 2023-11-15T20:32:31.106Z metadata={**metadata_values, **{key: get_nes... [trimmed]

davidlance

Document format

[ { "url": "https://www.domain.com/", "text": "lorem ipsum" }, { "url": "https://www.domain.com/", "text": "lorem ipsum" } ]

Jan Turoň (jan.turon)

Try to remove metadata_fields.

davidlance

Removed the metadata config, but still get an error.

2023-11-16T14:41:05.889Z ACTOR: Pulling Docker image of build JOzHG9AAnkWKYnSo4 from repository. 2023-11-16T14:41:10.573Z ACTOR: Creating Docker container. 2023-11-16T14:41:10.629Z ACTOR: Starting Docker container. 2023-11-16T14:41:13.489Z INFO Initializing actor... 2023-11-16T14:41:13.490Z INFO System info ({"apify_sdk_version": "1.1.1", "apify_client_version": "1.3.0", "python_version": "3.11.6", "os": "linux"}) 2023-11-16T14:41:13.662Z Loading dataset 2023-11-16T14:41:13.664Z Metadata fields loaded {} 2023-11-16T14:41:13.665Z ERROR Actor failed with an exception 2023-11-16T14:41:13.666Z Traceback (most recent call last): 2023-11-16T14:41:13.667Z File "/usr/src/app/src/main.py", line 49, in main 2023-11-16T14:41:13.668Z dataset_id=actor_input.get('payload')['resource']['defaultDatasetId'], 2023-11-16T14:41:13.669Z ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2023-11-16T14:41:13.669Z TypeError: 'NoneType' object is not subscriptable 2023-11-16T14:41:13.670Z INFO Exiting actor ({"exit_code": 91})

davidlance

This was run from the Pinecone actor directly. Will I need to rerun the webscraper actor again? Doesn't seem like I should need to, but that's the only thing I can think of. Just didn't want to take the time, or spend the money if I didn't have to.

davidlance

Also, without the metadata configured, how will the page urls be stored in pinecone? Traditionally I see them saved as metadata.

Jan Turoň (jan.turon)

URLs are not in the results set unfortunately - this integration can't fetch it. It's about actor itself. Also, yes, you need to run run again.

davidlance

Thanks for the quick response. Rerunning the webscraper now.

Feature requests:

Reference stored datasets we've created in Apify in the storage section. I see they have id's to uniquely identify them. We shouldn't have to start from scratch every time there's an error. We're rebuilding a dataset that already exists wasting time and money.
Open your actor up to more embedding models than just OpenAI. It would be nice to use HuggingFace models.

davidlance

Also... the Website Content Crawler actor provides lots of good metadata. Would be nice to be able to save that into Pinecone.

{ "url": "https://www.company.com/", "crawl": { "loadedUrl": "https://www.company.com/", "loadedTime": "2023-11-14T19:12:10.154Z", "referrerUrl": "https://www.company.com/", "depth": 0, "httpStatusCode": 200 }, "metadata": { "canonicalUrl": "https://www.company.com/", "title": "Digital Product Growth | Experience Experts", "description": "Company builds technology-enabled solutions that propel businesses and delight customers.", "author": null, "keywords": null, "languageCode": "en-US" }, "screenshotUrl": null, "text": "we help companies grow...." },

davidlance

Made another run with the webcrawler. The Pinecone Integration kicked off but failed. Logs seem to indicate there is metadata config, but that has been removed prior to the run. Please advise

2023-11-16T16:43:14.231Z ACTOR: Pulling Docker image of build JOzHG9AAnkWKYnSo4 from repository. 2023-11-16T16:43:18.786Z ACTOR: Creating Docker container. 2023-11-16T16:43:18.821Z ACTOR: Starting Docker container. 2023-11-16T16:43:21.319Z INFO Initializing actor... 2023-11-16T16:43:21.321Z INFO System info ({"apify_sdk_version": "1.1.1", "apify_client_version": "1.3.0", "python_version": "3.11.6", "os": "linux"}) 2023-11-16T16:43:21.464Z Loading dataset 2023-11-16T16:43:21.467Z Metadata fields loaded {'url': None} 2023-11-16T16:43:21.474Z Dataset loaded for field text 2023-11-16T16:43:21.477Z Loading documents for field text 2023-11-16T16:43:23.632Z ERROR Actor failed with an exception 2023-11-16T16:43:23.634Z Traceback (most recent call last): 2023-11-16T16:43:23.636Z File "/usr/src/app/src/main.py", line 59, in main 2023-11-16T16:43:23.638Z documents = loader.load() 2023-11-16T16:43:23.640Z ^^^^^^^^^^^^^ 2023-11-16T16:43:23.642Z File "/usr/local/lib/python3.11/site-packages/langchain/document_loaders/apify_dataset.py", line 54, in load 2023-11-16T16:43:23.644Z return list(map(self.dataset_mapping_function, dataset_items)) 2023-11-16T16:43:23.646Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^... [trimmed]

Jan Turoň (jan.turon)

I reviewed an actor and metadata_fields field seemed to be buggy. I released new version fixing it. Also you can now optionally pass dataset_id in your input schema and run this actor stand alone.

davidlance

Thanks for the fixes. I'll give them a try.

Add comment

WCC Pinecone Integration

tri_angle/wcc-pinecone-integration

Crawl any website and store its content in your Pinecone vector database. Enhance the accuracy and reliability of your own AI Assistant with facts fetched from external sources or connect this integration to our Pinecone GPT Chatbot assistant available in Apify Store.

Tri⟁angle

145

3.9

Chroma Integration

apify/chroma-integration

This integration transfers data from Apify Actors to a Chroma and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.7

Pinecone GPT Chatbot

tri_angle/pinecone-gpt-chatbot

Pinecone GPT Chatbot combines OpenAI's GPT models with Pinecone's database to generate insightful responses. Its interactive chatbot interface presents precise and comprehensive answers to user queries. Benefit from semantic understanding, efficient workflows, and enriched knowledge integration!

Tri⟁angle

4.6

OpenAI Vector Store Integration

jiri.spilka/openai-vector-store-integration

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

Jiří Spilka

176

4.8

Weaviate Integration

apify/weaviate-integration

This integration transfers data from Apify Actors to a Weaviate and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.7

Milvus Integration

apify/milvus-integration

This integration transfers data from Apify Actors to a Milvus/Zilliz database and is a good starting point for a question-answering, search, or RAG use case.

Apify

4.5

JB Hi-Fi Scraper

mshopik/jb-hi-fi-scraper

Scrape JB Hi-Fi and extract data on tv and video equipment from jbhifi.com.au. Our JB Hi-Fi API lets you crawl product information and pricing. The saved data can be downloaded as HTML, JSON, CSV, Excel, and XML.

Mark Carter

Qdrant Integration

apify/qdrant-integration

Transfer data from Apify Actors to a Qdrant vector database.

Apify

4.5

ESPN NBA Scraper (Current Season Stats)

scraped/espn-nba-scraper-current-season-stats

This actor provides NBA player statistics sourced from ESPN, including performance data such as points, rebounds, assists, and more.

scraped

5.0

OpenSearch Integration

apify/opensearch-integration

Transfer data from Apify Actors to Amazon OpenSearch Service. This Actor is a good starting point for building question-answering systems, search functionality, or Retrieval-Augmented Generation (RAG) use cases.

Apify

4.4