AI Agent Interaction Analyzer
Pricing
Pay per usage
AI Agent Interaction Analyzer
Evaluate AI agent conversations for quality, bias, and optimization. Uses DeepEval metrics for rigorous LLM-powered analysis or free heuristic scoring.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Rams
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Evaluate your AI agent conversations for quality, bias, hallucination, and toxicity. Get structured scores and actionable insights to improve your LLM-based applications.
What It Does
Feed in AI conversations (prompts + responses) and get back detailed evaluation scores across multiple dimensions. Ideal for AI developers, researchers, and teams building LLM-powered products who need to monitor and improve their AI outputs.
Use Cases
- Quality monitoring — Score AI responses for relevance, coherence, helpfulness, and completeness
- Bias detection — Identify confirmation bias, gender bias, racial bias, and other fairness issues
- Hallucination checking — Detect when AI fabricates facts not grounded in provided context
- Toxicity screening — Flag harmful or inappropriate language in AI outputs
- Model comparison — Compare response quality across different models or prompt versions
- Regression testing — Track quality over time as you update prompts or switch models
Evaluation Modes
| Mode | Cost | What You Get |
|---|---|---|
| heuristic | Free (no API key needed) | Fast scoring using text analysis — relevance, coherence, helpfulness, completeness, keyword-based bias detection |
| deepeval | Uses your OpenAI API key | Rigorous LLM-as-judge metrics — answer relevancy, faithfulness, coherence, helpfulness, hallucination, bias, toxicity |
| full | Uses your OpenAI API key | Both heuristic and DeepEval results combined for a complete picture |
Input
Provide your conversations as a JSON array. Each conversation needs an id and a messages array:
{"conversations": [{"id": "conv_001","messages": [{"role": "user", "content": "How do I implement caching in Redis?"},{"role": "assistant", "content": "Here's how to implement caching with Redis..."}],"context": "Optional: ground truth or source documents for faithfulness/hallucination checks"}],"mode": "heuristic","openaiApiKey": "sk-... (required for deepeval/full mode only)","modelName": "gpt-4o"}
Input Fields
| Field | Required | Description |
|---|---|---|
conversations | Yes (or use URL) | Array of conversation objects to evaluate |
conversationUrl | Alternative | URL to fetch conversation JSON from |
mode | No (default: heuristic) | Evaluation mode: heuristic, deepeval, or full |
openaiApiKey | For deepeval/full | Your OpenAI API key |
modelName | No (default: gpt-4o) | Which OpenAI model to use for evaluation |
Output
Each conversation gets a structured evaluation result pushed to the dataset:
Heuristic Mode Output
{"conversation_id": "conv_001","quality": {"overall": 0.812,"relevance": 1.0,"coherence": 0.85,"helpfulness": 0.5,"completeness": 0.9},"bias": {"toxicity": 0.0,"bias_detected": false,"categories": []}}
DeepEval Mode Output
{"conversation_id": "conv_001","relevancy": {"score": 1.0, "reason": "...", "passed": true},"faithfulness": {"score": 0.8, "reason": "...", "passed": true},"coherence": {"score": 0.9, "reason": "...", "passed": true},"helpfulness": {"score": 0.85, "reason": "...", "passed": true},"hallucination": {"score": 0.0, "reason": "...", "passed": true},"bias": {"score": 0.0, "reason": "...", "passed": true},"toxicity": {"score": 0.0, "reason": "...", "passed": true},"overall": 0.636}
Metrics Explained
Heuristic Metrics (Free)
- Relevance — Does the response use terms from the user's question?
- Coherence — Is the response well-structured with clear formatting?
- Helpfulness — Does it contain actionable content (examples, code, steps)?
- Completeness — Is the response proportionally thorough relative to the question?
- Bias categories — Detects confirmation, gender, racial, and age bias patterns
DeepEval Metrics (LLM-Powered)
- Answer Relevancy — Does the response actually answer what was asked?
- Faithfulness — Is the response grounded in the provided context? (requires
contextfield) - Coherence — Is it logically structured and easy to follow?
- Helpfulness — Does it provide actionable, useful information?
- Hallucination — Does it fabricate facts not in the context? (requires
contextfield) - Bias — Does it contain biased opinions or unfair statements?
- Toxicity — Does it contain toxic or harmful language?
Tips
- Start with
heuristicmode to quickly screen large batches at zero cost - Use
deepevalmode for detailed analysis of important conversations - Add a
contextfield to your conversations to enable faithfulness and hallucination checks - Use
gpt-4o-minias the model for cheaper deepeval runs with slightly lower accuracy - Export results as CSV from the Dataset tab for spreadsheet analysis
Pricing
- Heuristic mode: Only Apify platform compute costs (minimal)
- DeepEval/Full mode: Apify compute + your OpenAI API usage (~$0.01-0.10 per conversation depending on model)