Website Content Vector Retriever avatar

Website Content Vector Retriever

Try for free

No credit card required

Go to Store
Website Content Vector Retriever

Website Content Vector Retriever

hamza.alwan/website-content-vector-retriever
Try for free

No credit card required

Description

This Actor is primarily designed for use with the Chat with website GPT, but it can also be used for other purposes.

  1. Initiate the Process: Start by providing a URL and a Query. These are your primary inputs, but there are additional optional settings available for exploration.

  2. Web Crawling: The system efficiently crawls the specified URL using the reliable Website Content Crawler Actor.

  3. Data Transformation: The results from the Website Content Crawler are transformed into a query-friendly format using OpenAI's embedding function.

  4. Data Storage: The transformed results are stored in a dedicated Chroma database for efficient storage and quick retrieval.

  5. Query Execution: The Actor queries the transformed results within the Chroma database.

  6. Results: The Actor returns relevant results sorted by relevance into your dataset, prioritizing the most important information.

  7. Chat with Website GPT: The Chat with website GPT reads the results and responds to the user based on the Query.

Example Results

1{
2  "request": {
3    "url": "https://www.cnn.com/",
4    "query": "top stories"
5  },
6  "status": "Website crawled completely, and relevant content found.",
7  "results": [
8    {
9      "id": "b6922816-89f4-11ee-9146-4640c8bcfb86",
10      "text": "This Moroccan startup is growing crops in the desert \nWarren Buffett donates $870 million to charities ahead of Thanksgiving \nWhat’s open and closed on Thanksgiving Day 2023 \nJake Tapper reveals challenges of covering war, why he feels news outlets ‘censor too much’ and what has left him ‘shocked’ \nChicken wings are on Popeyes menu for good \nCEO of Chinese online game streaming firm arrested amid executive crackdown \nWhat is Binance, why is it in so much trouble, and what does it mean for crypto? \nAd Feedback\nQuote Search\nMarket Movers \nACTIVES GAINERS LOSERS \n$ Price % Change \nAd Feedback",
11      "metadata": {
12        "url": "https://edition.cnn.com/business"
13      },
14      "score": 0.4413832212525754
15    },
16    {
17      "id": "b69226f4-89f4-11ee-9146-4640c8bcfb86",
18      "text": "Liberia’s President George Weah concedes victory after tight run-off election \n‘They called me a slave’: Witness testimony exposes alleged RSF-led campaign to enslave men and women in Sudan \nDisturbing videos emerge showing atrocities against African ethnic groups in Darfur \nMass visa cancellations at Saudi airport puts relations with Nigeria in spotlight \nAmericas \nShow all\nGetty Images \nAs Argentina heads to the polls, will a plan to ditch the peso for the dollar be a vote-winner? \nMexico’s first openly non-binary magistrate and prominent LGBTQ activist found dead \nProtests against copper mine deal turn deadly in Panama \nPablo Escobar’s ‘cocaine hippos’ face Colombian government cull \nAsia \nShow all\nCaroline Chia/Reuters/File \nAustralian prime minister accuses Chinese navy of ‘dangerous’ conduct \nCivilians caught in the crossfire as fighting escalates between Myanmar military and armed group \nRenewed fighting in Myanmar has displaced 26,000 people since Monday, UN says",
19      "metadata": {
20        "url": "https://edition.cnn.com/world"
21      },
22      "score": 0.40209810318097283
23    },
24    {
25      "id": "b6922726-89f4-11ee-9146-4640c8bcfb86",
26      "text": "Look of the week: Emily Ratajkowski and the ultra-cinched puffer \nRemember when Angelina Jolie’s thigh-high slit dress kickstarted the ‘Angelina Effect’? \nAnimatronic model of E.T.’s head expected to fetch up to $1 million at auction \nThe Iranian artist using painting to honor her lost homeland \nWeather \nShow all\nCNN Weather \nDisruptive Thanksgiving-week storm affecting much of the East as holiday travel rush begins \nSeveral tornadoes reported as dangerous storms fire up, threatening 20 million from Texas to the Southeast \nSignificant Thanksgiving-week storm will disrupt travel with rain, snow and severe storms \nSnow and rain to bring weather woes to millions of Thanksgiving travelers \nMore of the latest stories\n•Video \n2-year-old dies after boat capsizes near Lampedusa \n•Video \nFamilies of hostages held by Hamas Await news of a possible release of their loved ones \nReuters \n•Video \nVideo shows Israeli soldiers forcefully arresting Palestinian man in his home \n•Video",
27      "metadata": {
28        "url": "https://edition.cnn.com/world"
29      },
30      "score": 0.3971370513809591
31    }
32  ]
33}

Detailed Description

Here's a detailed breakdown of how the Actor works:

  1. Input: Provide a URL and a Query. These are your primary inputs, but there are additional optional settings to explore.

  2. Unique ID Generation: Generate a unique ID from the URL for the key-value store. This ID will be used to store the final results of the Chroma database for querying and caching purposes.

  3. URL Check: Check if the URL has been previously scraped and if the results are saved in the key-value store.

    • If the URL has been previously scraped, retrieve the data from the Chroma database, query it, and return the results to the user.
  4. Scraping Check: Verify if the URL is currently undergoing scraping by another instance of the Actor, as it could be a lengthy process.

    • If the URL is presently under scraping, log the following results to the dataset:

      1{
      2    "request": {
      3        "url": "url",
      4        "query": "query"
      5    },
      6    "status": "The website '{url}' is currently undergoing crawling. Expect results shortly; they will be cached for faster access in the future.",
      7    "results": []
      8}
      • This information will be processed by ChatGPT, and the user will receive an appropriate response.
      • Refer to the notes for additional details; a webhook is already configured.
  5. Website Content Crawler: If the URL hasn't been scraped before, utilize the Website Content Crawler Actor to scrape the input URL.

  6. Data Transformation: Once the Actor has finished scraping the URL, the results will be converted by OpenAI's embedding function into a format that enhances the ease of querying of results.

  7. Data Storage: The transformed results are then saved in a dedicated Chroma database, which will also be cashed in the key-value store.

  8. Query Execution: The Actor then queries the transformed results within the Chroma database using the provided Query in the input.

  9. Results: The Actor then returns relevant results sorted by relevance into your dataset, example.

  10. Chat with Website GPT: The results will be read by ChatGPT (For the Chat with website GPT) and will respond to the user in accordance to his question (Query).

Notes

  1. This Actor is primarily designed for use with ChatGPT, which has a timeout of 45 seconds. Therefore, we've set a 42-second limit for the Actor to complete its process. If the Actor surpasses this limit, the following steps will be taken:

    • If the process exceeds 42 seconds, record the following results in the dataset:

      1{
      2    "request": {
      3        "url": "url",
      4        "query": "query"
      5    },
      6    "status": "Currently crawling the website '{url}', but it may take a minute. Results will be cached for faster access in the future.",
      7    "results": []
      8}
      • ChatGPT will interpret this information and advise the user to try again later.
      • A webhook is configured to wait for the completion of the Website Content Crawler Actor. Once completed, it will trigger a callback to the current Actor with the same input and the run ID. This
  2. By default each URLs data is cached for 30 days, but this can be changed in the input settings.

Developer
Maintained by Community

Actor Metrics

  • 1 monthly user

  • 5 stars

  • Created in Sep 2023

  • Modified a year ago

Categories