Question 1

What is generative AI?

Accepted Answer

Generative AI is a type of deep learning model focused on generating text, images, audio, video, code, and other data types in response to text prompts. Examples of generative AI models are ChatGPT, MidJourney, and BARD.

Question 2

What is the difference between AI and generative AI?

Accepted Answer

AI is a field of computer science that aims to create intelligent machines or systems that can perform tasks that typically require human intelligence. Generative AI is a subfield of AI focused on creating systems capable of generating new content, such as images, text, music, or video.

Question 3

What are large language models?

Accepted Answer

Large language models, or LLMs, are a form of generative AI. They are typically transformer models that use deep learning methods to understand and generate text in a human-like fashion. Examples of LLMs are ChatGPT, LLaMA, LLaMDA, and BARD.

Question 4

What is data ingestion?

Accepted Answer

Data ingestion is the process of collecting, processing, and preparing data for analysis or machine learning. In the context of LLMs, data ingestion involves collecting text data (web scraping), preprocessing it (cleaning, normalization, tokenization), and preparing it for training (feature engineering).

Question 5

Why use web scraping for AI?

Accepted Answer

Web scraping allows you to collect reliable, up-to-date information that can be used to feed, fine-tune, or train large language models (LLMs) or provide context for prompts for ChatGPT. In return, the model will answer questions based on your or your customer's websites and content.

Question 6

What are vector databases?

Accepted Answer

Vector databases are designed to handle the unique structure of vector embeddings, which are dense vectors of numbers that represent text. They are used in machine learning to index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another.

Question 7

What is LangChain?

Accepted Answer

LangChain is an open-source framework for developing applications powered by language models. It connects to the AI models you want to use and links them with outside sources. That means you can chain commands together so the AI model can know what it needs to do to produce the answers or perform the tasks you require.

Question 8

What is Pinecone?

Accepted Answer

Pinecone is a popular vector database that lets you provide long-term memory for high-performance AI applications. It is used for semantic search, similarity search for images and audio, recommendation systems, record matching, anomaly detection, and natural language processing.

Question 9

How do I train LLMs with scraped data?

Accepted Answer

Data collection: use a tool like Apify's Website Content Crawler to scrape web data. Configure the crawler settings like start URLs, crawler type, HTML processing, and data cleaning to tailor the data to what you need.
Data processing: clean and process the scraped data by removing unnecessary HTML elements, duplications, and transforming it into a usable format (e.g. JSON, CSV).
Integration and training: integrate the cleaned and processed data with tools like LangChain or Pinecone and feed it into your LLM to fine-tune or train the model according to your specific requirements. Check out this full step-by-step tutorial on how to collect data for LLMs with web scraping.

Question 10

What is retrieval-augmented generation?

Accepted Answer

RAG is an AI framework and technique used in natural language processing that combines elements of both retrieval-based and generation-based approaches to enhance the quality and relevance of generated text. It is used as a way to improve generative AI systems.

Question 11

Why use RAG for chatbots?

Accepted Answer

RAG is a popular method for creating chatbots because it combines retrieval-based and generative-based models. Retrieval-based models search a database for the most relevant answer. Generative models create answers on the fly. The combination of these two capabilities makes RAG chatbots adaptable and mitigates hallucinations.

Turn websites into data for AI

Generative AI is powered by web scraping

Convert any website into data for LLMs

LangChain and LlamaIndex integration

Ingest entire websites automatically...

...and use that data to power chatbots

Connect agents with Apify tools through MCP

Apify Adviser GPT

Read about AI and web scraping

Frequently asked questions