LlamaIndex agent
LlamaIndex agent to scrape, deduplicate and summarize contact details from a website
src/main.py
src/agent.py
src/tools.py
1"""This module defines the main entry point for the Apify LlamaIndex Agent.
2
3This Agent template is intended to give example on how to use LlamaIndex Agent with Apify Actors.
4It extracts contact details from a plain text query with a URL.
5
6Feel free to modify this file to suit your specific needs.
7
8To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
9https://docs.apify.com/sdk/python
10"""
11
12from __future__ import annotations
13
14import math
15from typing import TYPE_CHECKING
16
17from apify import Actor
18from llama_index.llms.openai import OpenAI
19
20from .agent import run_agent
21
22if TYPE_CHECKING:
23 from llama_index.core.chat_engine.types import AgentChatResponse
24
25
26async def main() -> None:
27 """Main entry point for the Apify LlamaIndex Agent.
28
29 This coroutine is executed using `asyncio.run()`, so it must remain an asynchronous function for proper execution.
30 Asynchronous execution is required for communication with Apify platform, and it also enhances performance in
31 the field of web scraping significantly.
32 """
33 async with Actor:
34 Actor.log.info('Starting LlamaIndex Agent')
35 count = math.ceil((Actor.get_env().get('memory_mbytes', 1024) or 1024) // 1024)
36 await Actor.charge(event_name='actor-start-gb', count=count)
37 Actor.log.info('Charged for Actor start %d GB', count)
38 try:
39 if not (actor_input := await Actor.get_input()):
40 await Actor.fail(status_message='Actor input was not provided')
41 return
42 await check_inputs(actor_input)
43 answer = await run_query(actor_input['query'], actor_input['modelName'])
44 await Actor.push_data({'query': actor_input['query'], 'answer': answer})
45 Actor.log.info('Charging for task completed')
46 await Actor.charge(event_name='task-completed', count=1)
47 except Exception as e:
48 await Actor.fail(status_message='Failed to process query', exception=e)
49
50
51async def check_inputs(actor_input: dict) -> None:
52 """Check that provided input exists.
53
54 :raises Exception: If query is not provided
55 """
56 if not actor_input.get('query'):
57 msg = 'Input `query` is not provided. Please verify that the `query` is correctly set.'
58 await Actor.fail(status_message=msg)
59
60
61async def run_query(query: str, model_name: str) -> AgentChatResponse | None:
62 """Process query with LlamaIndex Agent."""
63 llm = OpenAI(model=str(model_name), temperature=0)
64 try:
65 return await run_agent(query=query, llm=llm, verbose=True)
66 except Exception as e:
67 msg = f'Error running LlamaIndex Agent, error: {e}'
68 await Actor.fail(status_message=msg, exception=e)
Python LlamaIndex Agent Template
Create a new AI Agent with LlamaIndex using this template. It provides a basic structure for the Agent with the Apify SDK and allows you to easily add your own functionality.
Included features
- Apify SDK for Python - a toolkit for building Apify Actors and scrapers in Python.
- Input Schema - define and easily validate a schema for your Actor's input.
- LlamaIndex - a framework for building LLM-powered agents using your data.
- Dataset - a storage solution for structured data where each object stored shares the same attributes.
How it works
The Agent has two main tools:
call_contact_details_scraper
- Calls the Contact Details Scraper to scrape contact details from websites.summarize_contact_information
- Summarizes the collected contact details.
Given a user query with a URL, the Agent uses the Contact Details Scraper to retrieve the contact information and optionally summarizes the data. The Agent can decide how to handle the data—whether to process it further or skip summarization if it's not necessary.
Sample queries:
- Find contact details for
apify.com
and provide raw results. - Find contact details for
apify.com
and summarize them.
Before you start
To run this template locally or on the Apify platform, you need:
- An Apify account and an Apify API token.
- An OpenAI account and API key.
Monetization
This template uses the Pay Per Event (PPE) monetization model, which provides flexible pricing based on defined events.
To charge users, define events in JSON format and save them on the Apify platform. Here is an example of .actor/pay_per_event.json with the task-completed
event:
1[ 2 { 3 "task-completed": { 4 "eventTitle": "Task completed", 5 "eventDescription": "Cost per query answered.", 6 "eventPriceUsd": 0.1 7 } 8 } 9]
In the Actor, trigger the event with:
await Actor.charge({ eventName: 'task-completed' });
This approach allows you to programmatically charge users directly from your Actor, covering the costs of execution and related services, such as LLM input/output tokens.
Resources
Useful resources to help you get started:
- Apify Actors
- LlamaIndex agent
- Building a basic agent
- What are AI agents?
- 11 AI agent use cases on Apify
Additional material: Web Scraping Data for Generative AI
Scrape single page with provided URL with HTTPX and extract data from page's HTML with Beautiful Soup.
Example of a web scraper that uses Python HTTPX to scrape HTML from URLs provided on input, parses it using BeautifulSoup and saves results to storage.
Crawler example that uses headless Chrome driven by Playwright to scrape a website. Headless browsers render JavaScript and can help when getting blocked.
Scraper example built with Selenium and headless Chrome browser to scrape a website and save the results to storage. A popular alternative to Playwright.
Empty template with basic structure for the Actor with Apify SDK that allows you to easily add your own functionality.
Template with basic structure for an Actor using Standby mode that allows you to easily add your own functionality.