Initiate the Process: Start by providing a URL and a Query. These are your primary inputs, but there are additional optional settings available for exploration.
Web Crawling: The system efficiently crawls the specified URL using the reliable Website Content Crawler Actor.
Data Transformation: The results from the Website Content Crawler are transformed into a query-friendly format using OpenAI's embedding function.
Data Storage: The transformed results are stored in a dedicated Chroma database for efficient storage and quick retrieval.
Query Execution: The Actor queries the transformed results within the Chroma database.
Results: The Actor returns relevant results sorted by relevance into your dataset, prioritizing the most important information.
Chat with Website GPT: The Chat with website GPT reads the results and responds to the user based on the Query.

Example Results

{
  "request": {
    "url": "https://www.cnn.com/",
    "query": "top stories"
  },
  "status": "Website crawled completely, and relevant content found.",
  "results": [
    {
      "id": "b6922816-89f4-11ee-9146-4640c8bcfb86",
      "text": "This Moroccan startup is growing crops in the desert \nWarren Buffett donates $870 million to charities ahead of Thanksgiving \nWhat’s open and closed on Thanksgiving Day 2023 \nJake Tapper reveals challenges of covering war, why he feels news outlets ‘censor too much’ and what has left him ‘shocked’ \nChicken wings are on Popeyes menu for good \nCEO of Chinese online game streaming firm arrested amid executive crackdown \nWhat is Binance, why is it in so much trouble, and what does it mean for crypto? \nAd Feedback\nQuote Search\nMarket Movers \nACTIVES GAINERS LOSERS \n$ Price % Change \nAd Feedback",
      "metadata": {
        "url": "https://edition.cnn.com/business"
      },
      "score": 0.4413832212525754
    },
    {
      "id": "b69226f4-89f4-11ee-9146-4640c8bcfb86",
      "text": "Liberia’s President George Weah concedes victory after tight run-off election \n‘They called me a slave’: Witness testimony exposes alleged RSF-led campaign to enslave men and women in Sudan \nDisturbing videos emerge showing atrocities against African ethnic groups in Darfur \nMass visa cancellations at Saudi airport puts relations with Nigeria in spotlight \nAmericas \nShow all\nGetty Images \nAs Argentina heads to the polls, will a plan to ditch the peso for the dollar be a vote-winner? \nMexico’s first openly non-binary magistrate and prominent LGBTQ activist found dead \nProtests against copper mine deal turn deadly in Panama \nPablo Escobar’s ‘cocaine hippos’ face Colombian government cull \nAsia \nShow all\nCaroline Chia/Reuters/File \nAustralian prime minister accuses Chinese navy of ‘dangerous’ conduct \nCivilians caught in the crossfire as fighting escalates between Myanmar military and armed group \nRenewed fighting in Myanmar has displaced 26,000 people since Monday, UN says",
      "metadata": {
        "url": "https://edition.cnn.com/world"
      },
      "score": 0.40209810318097283
    },
    {
      "id": "b6922726-89f4-11ee-9146-4640c8bcfb86",
      "text": "Look of the week: Emily Ratajkowski and the ultra-cinched puffer \nRemember when Angelina Jolie’s thigh-high slit dress kickstarted the ‘Angelina Effect’? \nAnimatronic model of E.T.’s head expected to fetch up to $1 million at auction \nThe Iranian artist using painting to honor her lost homeland \nWeather \nShow all\nCNN Weather \nDisruptive Thanksgiving-week storm affecting much of the East as holiday travel rush begins \nSeveral tornadoes reported as dangerous storms fire up, threatening 20 million from Texas to the Southeast \nSignificant Thanksgiving-week storm will disrupt travel with rain, snow and severe storms \nSnow and rain to bring weather woes to millions of Thanksgiving travelers \nMore of the latest stories\n•Video \n2-year-old dies after boat capsizes near Lampedusa \n•Video \nFamilies of hostages held by Hamas Await news of a possible release of their loved ones \nReuters \n•Video \nVideo shows Israeli soldiers forcefully arresting Palestinian man in his home \n•Video",
      "metadata": {
        "url": "https://edition.cnn.com/world"
      },
      "score": 0.3971370513809591
    }
  ]
}

Detailed Description

Here's a detailed breakdown of how the Actor works:

Input: Provide a URL and a Query. These are your primary inputs, but there are additional optional settings to explore.
Unique ID Generation: Generate a unique ID from the URL for the key-value store. This ID will be used to store the final results of the Chroma database for querying and caching purposes.
URL Check: Check if the URL has been previously scraped and if the results are saved in the key-value store.
- If the URL has been previously scraped, retrieve the data from the Chroma database, query it, and return the results to the user.

Scraping Check: Verify if the URL is currently undergoing scraping by another instance of the Actor, as it could be a lengthy process.

If the URL is presently under scraping, log the following results to the dataset:

{
    "request": {
        "url": "url",
        "query": "query"
    },
    "status": "The website '{url}' is currently undergoing crawling. Expect results shortly; they will be cached for faster access in the future.",
    "results": []
}

This information will be processed by ChatGPT, and the user will receive an appropriate response.
Refer to the notes for additional details; a webhook is already configured.

Website Content Crawler: If the URL hasn't been scraped before, utilize the Website Content Crawler Actor to scrape the input URL.
Data Transformation: Once the Actor has finished scraping the URL, the results will be converted by OpenAI's embedding function into a format that enhances the ease of querying of results.
Data Storage: The transformed results are then saved in a dedicated Chroma database, which will also be cashed in the key-value store.
Query Execution: The Actor then queries the transformed results within the Chroma database using the provided Query in the input.
Results: The Actor then returns relevant results sorted by relevance into your dataset, example.
Chat with Website GPT: The results will be read by ChatGPT (For the Chat with website GPT) and will respond to the user in accordance to his question (Query).

Notes

This Actor is primarily designed for use with ChatGPT, which has a timeout of 45 seconds. Therefore, we've set a 42-second limit for the Actor to complete its process. If the Actor surpasses this limit, the following steps will be taken:

If the process exceeds 42 seconds, record the following results in the dataset:

{
    "request": {
        "url": "url",
        "query": "query"
    },
    "status": "Currently crawling the website '{url}', but it may take a minute. Results will be cached for faster access in the future.",
    "results": []
}

ChatGPT will interpret this information and advise the user to try again later.
A webhook is configured to wait for the completion of the Website Content Crawler Actor. Once completed, it will trigger a callback to the current Actor with the same input and the run ID. This

By default each URLs data is cached for 30 days, but this can be changed in the input settings.

On this page

Website Content Vector Retriever

Share Actor:

OpenAI Vector Store Integration

jiri.spilka/openai-vector-store-integration

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

Jiří Spilka

180

4.8

Website extract

mrahil/my-actor

It is website extractor

Mohammed Rahil

tsboi index

trim_flag/tsboi-index

Indexing for LLMs. This application crawls specified websites, processes their content into a searchable vector database, and enables users to ask natural language questions about the content.

Ikenna Chidoka

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.4K

4.6

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

64K

4.3

Qdrant Integration

apify/qdrant-integration

Transfer data from Apify Actors to a Qdrant vector database.

Apify

4.5

Website Scraper

grihithbhoir707/website-scraper

Grihith Bhoir

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

460

4.4

TikTok Media and Metadata Retriever

gratenes/tiktok-media-and-metadata-retriever

An api for gathering media and metadata from any TikTok media url, supports vm.tiktok.com, vt.tiktok.com and other TikTok short links.

Fast URL Content Crawler

6sigmag/fast-url-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng