Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

WCC Pinecone Integration

Under maintenance

Try for free

Developed by

Tri⟁angle

Crawl any website and store its content in your Pinecone vector database. Enhance the accuracy and reliability of your own AI Assistant with facts fetched from external sources or connect this integration to our Pinecone GPT Chatbot assistant available in Apify Store.

3.2 (5)

Pricing

Pay per usage

Issues response

52 days

Last modified

6 months ago

Automation

Integrations

This actor integrates the Website Content Crawler (WCC) with the Pinecone vector database. Its main goal is to scrape a specific website and store the scraped text data into a Pinecone database in form of embeddings. The actor serves as a crucial use case for web data management and leverages LLM RAG capabilities to ensure seamless functionality out of the box. Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources (in our case user’s vector database).

Additionally, you can connect your Pinecone database with OpenAI's GPT model using our Pinecone GPT Chatbot. This Actor provides you with an interactive chatbot application similar to the well known Chat GPT. You can ask questions as if you were chatting with GPT but thanks to the integration with Pinecone vector database, the model has more rich and up-to-date knowledge base.

How it works

Actor triggers WCC to crawl the website specified in the input (url).
When the WCC is finished, the scraped text will be encoded using OpenAI embeddings and stored into Pinecone database
- the actor makes sure that only new and updated pages are encoded and stored in Pinecone to save resources

How to use it

In order to successfully run the actor, you need to provide the following fields:

Website URL
OpenAI API key (required)
Pinecone API key (required)
Pinecone index name (required): provide a name of your Pinecone index (the actor will create a new one if it doesn't exist). If you're using an existing index, make sure it's dimension is set to 1536, otherwise the actor will fail.

Other fields to tweak the actor's settings:

You can adjust WCC's settings in Website Content Crawler settings and HTML processing sections
Documents (text) processing can be configured in Document chunk settings
Use Vector database query to get relevant documents from the database
- additionally, use No website crawling ... flag to disable scraping and only query the database

Input example

{
    "url": "https://apify.com/change-log/performance-api-updates-adaptive-playwright-crawler",
    "openaiApiKey": "YOUR_OPENAI_KEY",
    "pineconeApiKey": "YOUR_PINECONE_KEY",
    "cacheKeyValueStoreName": "website-content-vector-cache",
    "noCrawling": false,
    "pineconeIndexName": "your-pinecone-index-name",
    "query": "What is an Adaptive Playwright Crawler and how can I use it to crawl apify.com website? Include TypeScript code example demonstrating the usage of this adaptive crawler.",
    "chunkSize": 2000,
    "chunkOverlap": 200,
    "maxCrawlPages": 1,
    "maxCrawlDepth": 0
}

Output example

If you provide query in input, the actor will output documents from the Pinecone database that are relevant to your query, sorted by the most relevant to the least relevant using the score value. Note that the following example merged data from 3 different runs and their corresponding start URLs:

[
  {
    "id": "0b2d4817d2698f166ed02d90d74e6156ffd8ee593ba5ef7cef287ce6deff900f",
    "score": 0.850724638,
    "values": [],
    "metadata": {
      "text": "As part of our continuous performance improvement initiative, we're happy to announce that we successfully improved the Apify API response time by 50% on average and the 90th-percentile startup time of Actors by about 20%. We will continue improving Apify in this direction.\nAPI updates\nUser limits endpoint now returns maxConcurrentActorJobs and activeActorJobCount properties enabling users to keep an eye on the concurrency limit.\nWe also added the missing endpoint /actor-builds/:build-id/log, allowing you to quickly access the log of certain builds without a need for an Actor run ID.\nAdaptive Playwright Crawler\nTry out Crawlee's new AdaptivePlaywrightCrawler class abstraction, which is an extension of PlaywrightCrawler that uses a more limited request handler interface so that it's able to switch to HTTP-only crawling when it detects that it may be possible. This way, you can achieve lower costs when crawling multiple websites.\n1const crawler = new AdaptivePlaywrightCrawler({ 2 renderingTypeDetectionRatio: 0.1, 3 async requestHandler({ querySelector, pushData, enqueueLinks, request, log }) { 4 // This function is called to extract data from a single web page 5 const $prices = await querySelector('span.price') 6 7 await pushData({ 8 url: request.url, 9 price: $prices.filter(':contains(\"$\")').first().text(), 10 }) 11 12 await enqueueLinks({ selector: '.pagination a' }) 13 }, 14}); 15 16await crawler.run([ 17 'http://www.example.com/page-1', 18 'http://www.example.com/page-2', 19]);",
      "url": "https://apify.com/change-log/performance-api-updates-adaptive-playwright-crawler"
    }
  },
  {
    "id": "5a8e1b8a36076db902e9d7c5063ae6eee6e6e5833358d13982a377c480e7b87c",
    "score": 0.818883061,
    "values": [],
    "metadata": {
      "text": "Founded in 2015\nApify was launched by Jan Čurn and Jakub Balada in 2015 from the Y Combinator Fellowship in Mountain View, California. The original idea was to make it easy for developers to build flexible and scalable web crawlers simply using front-end JavaScript, thanks to the back-then new headless browser technology.\nBuilt with ❤️ and 🍺 in Prague\nIn 2016, the team moved back to the Czech Republic, raised a seed investment, and started building a company around its product. Soon it became obvious that customers’ use cases need more than a simple JavaScript crawler, so we committed to building the most flexible full-stack platform for web scraping and browser automation.\nOur mission\nWe make the web more programmable, to let people automate mundane tasks on the web and spend their time on things that matter. We strive to keep the web open as a public good and a basic right for everyone, regardless of the way you want to use it, as its creators intended.\n2,500+Customers worldwide\n4 B+Web pages crawled monthly\n1,600+Ready-made Actors in Store\nBrand resources",
      "url": "https://apify.com/about"
    }
  },
  {
    "id": "c73be5748c2d04783587eebac174c37c31864c1dc66872aa7b7161cb3a1ed8ec",
    "score": 0.801438034,
    "values": [],
    "metadata": {
      "text": "Start URLs\nhttps://crawlee.dev\nGlob patterns\nhttps://crawlee.dev/*/*\nResults of successful run:\nMonthly Actor rental fee\n-$0\nOverall platform usage*\n-$0.036\n*Usage can differ for every run. The example above uses default settings for each Actor.\nPlatform usage breakdown:\nActor compute units\n$0.035\nRegistered users can check their daily usage chart in Apify Console \nWhat is the prepaid platform usage, and how much do I need?\nThe Apify platform has a number of services that are charged based on usages, such as Actors, proxies, data transfer, and storage. See pricing for the full list of platform services.",
      "url": "https://apify.com/pricing"
    }
  },
  {
    "id": "ce47020d694161e4397370e330e711f4b165c4ed63d74995ce5ef1786042631e",
    "score": 0.780740678,
    "values": [],
    "metadata": {
      "text": "The Apify platform has a number of services that are charged based on usages, such as Actors, proxies, data transfer, and storage. See pricing for the full list of platform services. \nEach subscription plan comes with a certain amount of prepaid platform usage that is used to pay for services. If your platform usage in a given billing cycle exceeds this prepaid amount, the excess usage will be added to your next invoice, and you'll get a notification. If you're on the free plan, your access to Apify's services will be blocked until the beginning of the next monthly cycle. \nNote that unused usage credits are not rolled over to the next billing cycle, and they expire at the end of the billing cycle. \nCan I try Apify for free?",
      "url": "https://apify.com/pricing"
    }
  },
  {
    "id": "fb0eeed4c9f6c222a6a629b07a4db428c89a0bb6484ff8e45f7275abf9702aab",
    "score": 0.777982414,
    "values": [],
    "metadata": {
      "text": "How does Apify pay as you go work?\nIf you're on one of Apify's paid plans, you can continue using the platform after reaching the limit by paying the rest as overage. That means you don't have to change your pricing plan to exceed the usage limit of your current plan. \nDoes Apify offer any discounts for charities and universities?\nApify offers a discount on its paid plans to students of accredited educational institutions. Students of those institutions are eligible for 30% off Starter, and Scale plans. If you have any questions, contact us. \nI would like to develop Actors. What should I do?\nApify Academy is a free course that shows you how to start developing Actors on the Apify platform. You can also find more information in the Apify documentation. \nAny other questions? Please contact us.",
      "url": "https://apify.com/pricing"
    }
  },
  {
    "id": "623beea25a535a6fa62f0db309ef7dd5cf6df5e4b1a1c52a7c1de0615072fcda",
    "score": 0.776461363,
    "values": [],
    "metadata": {
      "text": "Increased Actor RAM\n$2 / GB\nDatacenter proxy\nfrom $0.6 / IP address\nPersonal tech training\n$200 / hour\nPriority chat\n$100\nDo you want to build your own Actors?\nHere is a special offer for you: our Creator Plan! For just $1 per month, enjoy $500 worth of free usage and other benefits for 6 months, but please note that you will have access to only some Apify Store Actors.\nHow pricing works\nApify's pricing is all about how you use the platform - here's a breakdown for a typical $49/month plan as an example.\nMonthly prepaid usage $49 + pay as you go\nActor rentals\n$49\nSet your limit\nYour limit\nPay as you go\nMonthly Actor rental fee\n1st Actor usage*\n2nd Actor usage*\n*Each Actor run is different. The above pricing breakdown is just an example.\nRegistered users can check their daily usage chart in Apify Console \nStart URLs\nhttps://crawlee.dev\nGlob patterns\nhttps://crawlee.dev/*/*\nResults of successful run:\nMonthly Actor rental fee\n-$0\nOverall platform usage*\n-$0.036",
      "url": "https://apify.com/pricing"
    }
  }
]

On this page

Share Actor:

Pinecone GPT Chatbot

tri_angle/pinecone-gpt-chatbot

Pinecone GPT Chatbot combines OpenAI's GPT models with Pinecone's database to generate insightful responses. Its interactive chatbot interface presents precise and comprehensive answers to user queries. Benefit from semantic understanding, efficient workflows, and enriched knowledge integration!

Tri⟁angle

4.9

Sitemap Change Orchestrator

tri_angle/sitemap-change-orchestrator

Monitor website sitemaps for new, updated, or removed URLs. Integration with the Website Content Crawler (WCC) allows feeding only relevant URLs. This ensures your web crawls are efficient, targeted, and resource-optimized, keeping your datasets fresh for any application.

Tri⟁angle

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

5.0

Sitemap Sniffer

vaclavrut/sitemap-sniffer

Sitemap sniffer will check the most used variants of sitemaps and you can use that for crawling. This will just save you time so you don't have to check manually.

Vaclav Rut

637

5.0

Sitemap Detector

coder_zoro/sitemap-detector

Find sitemap URLs fast with our free Sitemap Finder tool. Instantly detect sitemaps from any website for SEO audits, indexing checks, and crawl planning. Improve visibility, site structure insights, and search engine performance in just seconds

Zoro

5.0

YellowPages.ca Business Data Scraper

delicious_zebu/yellowpages-ca-business-data-scraper

Effortlessly extract comprehensive Canadian business data from YellowPages.ca with flexible search options, rich detail extraction, and customizable pagination for your market research and lead generation needs.

ВAH

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

Pinecone Integration

apify/pinecone-integration

This integration transfers data from Apify Actors to a Pinecone and is a good starting point for a question-answering, search, or RAG use case.

Apify

453

3.2

YellowPages South Africa Business Lead Generator

lead.gen.labs/yellowpages-south-africa-business-lead-generator

A powerful web scraper designed to extract business information from YellowPages South Africa. If you're looking for leads, contact details, or business insights, this actor helps you quickly gather essential data such as business names, addresses, emails, websites, and descriptions.