WCC Pinecone Integration avatar
WCC Pinecone Integration
Try for free

No credit card required

View all Actors
WCC Pinecone Integration

WCC Pinecone Integration

tri_angle/wcc-pinecone-integration
Try for free

No credit card required

This actor integrates the Website Content Crawler (WCC) with the Pinecone vector database. Its main goal is to scrape a specific website and store the scraped text data into a Pinecone database in form of embeddings. The actor serves as a crucial use case for web data management and leverages LLM RAG capabilities to ensure seamless functionality out of the box. Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources (in our case user’s vector database).

Additionally, you can connect your Pinecone database with OpenAI's GPT model using our Pinecone GPT Chatbot. This Actor provides you with an interactive chatbot application similar to the well known Chat GPT. You can ask questions as if you were chatting with GPT but thanks to the integration with Pinecone vector database, the model has more rich and up-to-date knowledge base.

How it works

  1. Actor triggers WCC to crawl the website specified in the input (url).
  2. When the WCC is finished, the scraped text will be encoded using OpenAI embeddings and stored into Pinecone database
    • the actor makes sure that only new and updated pages are encoded and stored in Pinecone to save resources

How to use it

In order to successfully run the actor, you need to provide the following fields:

  • Website URL
  • OpenAI API key (required)
  • Pinecone API key (required)
  • Pinecone index name (required): provide a name of your Pinecone index (the actor will create a new one if it doesn't exist). If you're using an existing index, make sure it's dimension is set to 1536, otherwise the actor will fail.

Other fields to tweak the actor's settings:

  • You can adjust WCC's settings in Website Content Crawler settings and HTML processing sections
  • Documents (text) processing can be configured in Document chunk settings
  • Use Vector database query to get relevant documents from the database
    • additionally, use No website crawling ... flag to disable scraping and only query the database

Input example

1{
2    "url": "https://apify.com/change-log/performance-api-updates-adaptive-playwright-crawler",
3    "openaiApiKey": "YOUR_OPENAI_KEY",
4    "pineconeApiKey": "YOUR_PINECONE_KEY",
5    "cacheKeyValueStoreName": "website-content-vector-cache",
6    "noCrawling": false,
7    "pineconeIndexName": "your-pinecone-index-name",
8    "query": "What is an Adaptive Playwright Crawler and how can I use it to crawl apify.com website? Include TypeScript code example demonstrating the usage of this adaptive crawler.",
9    "chunkSize": 2000,
10    "chunkOverlap": 200,
11    "maxCrawlPages": 1,
12    "maxCrawlDepth": 0
13}

Output example

If you provide query in input, the actor will output documents from the Pinecone database that are relevant to your query, sorted by the most relevant to the least relevant using the score value. Note that the following example merged data from 3 different runs and their corresponding start URLs:

1[
2  {
3    "id": "0b2d4817d2698f166ed02d90d74e6156ffd8ee593ba5ef7cef287ce6deff900f",
4    "score": 0.850724638,
5    "values": [],
6    "metadata": {
7      "text": "As part of our continuous performance improvement initiative, we're happy to announce that we successfully improved the Apify API response time by 50% on average and the 90th-percentile startup time of Actors by about 20%. We will continue improving Apify in this direction.\nAPI updates\nUser limits endpoint now returns maxConcurrentActorJobs and activeActorJobCount properties enabling users to keep an eye on the concurrency limit.\nWe also added the missing endpoint /actor-builds/:build-id/log, allowing you to quickly access the log of certain builds without a need for an Actor run ID.\nAdaptive Playwright Crawler\nTry out Crawlee's new AdaptivePlaywrightCrawler class abstraction, which is an extension of PlaywrightCrawler that uses a more limited request handler interface so that it's able to switch to HTTP-only crawling when it detects that it may be possible. This way, you can achieve lower costs when crawling multiple websites.\n1const crawler = new AdaptivePlaywrightCrawler({ 2 renderingTypeDetectionRatio: 0.1, 3 async requestHandler({ querySelector, pushData, enqueueLinks, request, log }) { 4 // This function is called to extract data from a single web page 5 const $prices = await querySelector('span.price') 6 7 await pushData({ 8 url: request.url, 9 price: $prices.filter(':contains(\"$\")').first().text(), 10 }) 11 12 await enqueueLinks({ selector: '.pagination a' }) 13 }, 14}); 15 16await crawler.run([ 17 'http://www.example.com/page-1', 18 'http://www.example.com/page-2', 19]);",
8      "url": "https://apify.com/change-log/performance-api-updates-adaptive-playwright-crawler"
9    }
10  },
11  {
12    "id": "5a8e1b8a36076db902e9d7c5063ae6eee6e6e5833358d13982a377c480e7b87c",
13    "score": 0.818883061,
14    "values": [],
15    "metadata": {
16      "text": "Founded in 2015\nApify was launched by Jan Čurn and Jakub Balada in 2015 from the Y Combinator Fellowship in Mountain View, California. The original idea was to make it easy for developers to build flexible and scalable web crawlers simply using front-end JavaScript, thanks to the back-then new headless browser technology.\nBuilt with ❤️ and 🍺 in Prague\nIn 2016, the team moved back to the Czech Republic, raised a seed investment, and started building a company around its product. Soon it became obvious that customers’ use cases need more than a simple JavaScript crawler, so we committed to building the most flexible full-stack platform for web scraping and browser automation.\nOur mission\nWe make the web more programmable, to let people automate mundane tasks on the web and spend their time on things that matter. We strive to keep the web open as a public good and a basic right for everyone, regardless of the way you want to use it, as its creators intended.\n2,500+Customers worldwide\n4 B+Web pages crawled monthly\n1,600+Ready-made Actors in Store\nBrand resources",
17      "url": "https://apify.com/about"
18    }
19  },
20  {
21    "id": "c73be5748c2d04783587eebac174c37c31864c1dc66872aa7b7161cb3a1ed8ec",
22    "score": 0.801438034,
23    "values": [],
24    "metadata": {
25      "text": "Start URLs\nhttps://crawlee.dev\nGlob patterns\nhttps://crawlee.dev/*/*\nResults of successful run:\nMonthly Actor rental fee\n-$0\nOverall platform usage*\n-$0.036\n*Usage can differ for every run. The example above uses default settings for each Actor.\nPlatform usage breakdown:\nActor compute units\n$0.035\nRegistered users can check their daily usage chart in Apify Console \nWhat is the prepaid platform usage, and how much do I need?\nThe Apify platform has a number of services that are charged based on usages, such as Actors, proxies, data transfer, and storage. See pricing for the full list of platform services.",
26      "url": "https://apify.com/pricing"
27    }
28  },
29  {
30    "id": "ce47020d694161e4397370e330e711f4b165c4ed63d74995ce5ef1786042631e",
31    "score": 0.780740678,
32    "values": [],
33    "metadata": {
34      "text": "The Apify platform has a number of services that are charged based on usages, such as Actors, proxies, data transfer, and storage. See pricing for the full list of platform services. \nEach subscription plan comes with a certain amount of prepaid platform usage that is used to pay for services. If your platform usage in a given billing cycle exceeds this prepaid amount, the excess usage will be added to your next invoice, and you'll get a notification. If you're on the free plan, your access to Apify's services will be blocked until the beginning of the next monthly cycle. \nNote that unused usage credits are not rolled over to the next billing cycle, and they expire at the end of the billing cycle. \nCan I try Apify for free?",
35      "url": "https://apify.com/pricing"
36    }
37  },
38  {
39    "id": "fb0eeed4c9f6c222a6a629b07a4db428c89a0bb6484ff8e45f7275abf9702aab",
40    "score": 0.777982414,
41    "values": [],
42    "metadata": {
43      "text": "How does Apify pay as you go work?\nIf you're on one of Apify's paid plans, you can continue using the platform after reaching the limit by paying the rest as overage. That means you don't have to change your pricing plan to exceed the usage limit of your current plan. \nDoes Apify offer any discounts for charities and universities?\nApify offers a discount on its paid plans to students of accredited educational institutions. Students of those institutions are eligible for 30% off Starter, and Scale plans. If you have any questions, contact us. \nI would like to develop Actors. What should I do?\nApify Academy is a free course that shows you how to start developing Actors on the Apify platform. You can also find more information in the Apify documentation. \nAny other questions? Please contact us.",
44      "url": "https://apify.com/pricing"
45    }
46  },
47  {
48    "id": "623beea25a535a6fa62f0db309ef7dd5cf6df5e4b1a1c52a7c1de0615072fcda",
49    "score": 0.776461363,
50    "values": [],
51    "metadata": {
52      "text": "Increased Actor RAM\n$2 / GB\nDatacenter proxy\nfrom $0.6 / IP address\nPersonal tech training\n$200 / hour\nPriority chat\n$100\nDo you want to build your own Actors?\nHere is a special offer for you: our Creator Plan! For just $1 per month, enjoy $500 worth of free usage and other benefits for 6 months, but please note that you will have access to only some Apify Store Actors.\nHow pricing works\nApify's pricing is all about how you use the platform - here's a breakdown for a typical $49/month plan as an example.\nMonthly prepaid usage $49 + pay as you go\nActor rentals\n$49\nSet your limit\nYour limit\nPay as you go\nMonthly Actor rental fee\n1st Actor usage*\n2nd Actor usage*\n*Each Actor run is different. The above pricing breakdown is just an example.\nRegistered users can check their daily usage chart in Apify Console \nStart URLs\nhttps://crawlee.dev\nGlob patterns\nhttps://crawlee.dev/*/*\nResults of successful run:\nMonthly Actor rental fee\n-$0\nOverall platform usage*\n-$0.036",
53      "url": "https://apify.com/pricing"
54    }
55  }
56]
Developer
Maintained by Apify
Actor metrics
  • 1 monthly users
  • 93.8% runs succeeded
  • days response time
  • Created in May 2024
  • Modified about 17 hours ago