Evaluate AI Agents
Introduction:
So, you’ve built an AI agent. It’s smart, it uses tools, and it feels like magic. But here’s the million-dollar question: Can you actually trust it in front of a customer?
Unlike traditional software that follows a predictable « if-this-then-that » logic, AI agents are autonomous and, frankly, a bit unpredictable. Evaluating them isn’t just a « nice-to-have » task—it’s your safety net.
Let’s dive into why evaluation matters, what you should be measuring, and how to build a rock-solid evaluation strategy.
1. Why is Evaluation Critical?
If standard code is deterministic, AI agents are « probabilistic. » This means they can behave differently every time. Evaluation is your way of moving from « I hope it works » to « I know it works. »
-
Safety & Alignment: It’s not just about technical performance. You need to ensure your agent respects human values, remains unbiased, and complies with regulations like GDPR or the AI Act. A brilliant agent that ignores ethics is a liability.
-
Protecting Your Brand: We’ve all seen the headlines. Remember the recent buzz around models generating inappropriate content from user photos? Without a guardrail, a support agent can ruin your reputation in one bad interaction.
-
The Overfitting Trap: It’s easy to build a system that aces a specific test but fails miserably with a real human. Proper evaluation helps you spot if your agent is just « memorizing » or truly « understanding. »
-
Guaranteed ROI: Tokens aren’t free! Evaluation helps you find the sweet spot between cost and performance, preventing you from over-engineering a solution that doesn’t add business value.
2. What Should You Actually Measure?
Evaluating an agent is different from testing a simple chatbot. You aren’t just checking words; you’re checking actions.
-
Task Performance: Can it finish what it started? Benchmarks like AgentBench show a harsh reality: while top models succeed 80% of the time on simple tasks, that number can drop to 50% for complex, end-to-end automation.
-
Tool Usage (The « Agentic » Secret Sauce): This is what makes an agent an agent. You need to measure:
-
Selection: Did it pick the right tool?
-
Formating: Are the API parameters correct?
-
Interpretation: Did it understand the tool’s output to plan the next step?
-
Pro Tip: Frameworks like T-Eval are specifically designed to measure how accurately LLMs interact with external tool
3. The Methodology: How to Evaluate Like a Pro
A reliable evaluation needs many layers. One perspective is never enough:
- LLM-as-a-Judge (The Automated Layer): Forget old metrics like BLEU or ROUGE—they just compare words. Today, we use a high-performing model (the « Judge ») to grade a smaller model (the « Student »). A Judge LLM can understand nuance, reasoning, and context.
-
Human-in-the-Loop (The Expert Layer)
- Domain Expertise: For regulated industries, you need humans. They use Likert scales to grade the « logic » behind an answer.
- A/B Testing: This is the ultimate test. Deploy two versions (different prompts or RAG setups) and see which one actually makes users happier.
4. Your Action Plan for Deployment
Ready to ship? Here are my advices:
-
Define Business Metrics: Don’t just say « it works. » Track latency, token cost, and task completion rates.
-
Build Diverse Datasets: Include your « happy paths, » but focus on edge cases and malicious inputs.
-
Governance: Get tech, business, and legal in one room to agree on what a « good » answer looks like.
-
Gradual Rollout: Start with offline benchmarks, move to a Canary deployment, and only then go global.
5. Putting it into Practice: Examples
Case 1: Testing Precision with LLM-as-a-Judge
Imagine we ask our agent for the US inflation rate. It replies: « It’s 2.5%. »
from langchain_classic.evaluation.scoring import ScoreStringEvalChain
from langchain_openai import ChatOpenAI
llm_as_judge = ChatOpenAI(model_name="gpt-5", temperature=0)
eval_chain = ScoreStringEvalChain.from_llm(llm=llm_as_judge)
input = "What is the current inflation rate in USA ?"
agent_answer = "The inflation rate in USA is 2.5%."
evluation_without_reference_answer = eval_chain.evaluate_strings(
input=input,
prediction=agent_answer,
)
print(f"Evaluation result without reference answer : {evluation_without_reference_answer}")>> Evaluation result with reference answer : {'reasoning': 'The answer is concise, directly addresses the question, and matches the provided ground truth (2.5%). It is accurate, relevant, and appropriate for the straightforward query. While minimal in depth, additional detail wasn’t necessary.\n\nRating: [[10]]', 'score': 10}
The Judge gives this a 2/10. Why? Because it lacks context and sourcing. However, if we provide a « Ground Truth » (the expected answer) that matches that figure:
from langchain_classic.evaluation.scoring import LabeledScoreStringEvalChain
labeled_eval_chain = LabeledScoreStringEvalChain.from_llm(llm=llm_as_judge)
reference_answer = "The inflation rate currently in USA is equal to 2.5 percent"
evluation_with_reference_answer = labeled_eval_chain.evaluate_strings(
input=input,
prediction=agent_answer,
reference=reference_answer
)
print(f"Evaluation result with reference answer : {evluation_with_reference_answer}")
>> Evaluation result with reference answer : {'reasoning': 'The answer is concise, directly addresses the question, and matches the provided ground truth (2.5%). It is accurate, relevant, and appropriate for the straightforward query. While minimal in depth, additional detail wasn’t necessary.\n\nRating: [[10]]', 'score': 10}
Now the Judge is happy! With the reference provided, the agent gets a 10/10.
Case 2: Tone & Personality
A « mean » agent is a business killer. Here we compare a friendly response versus a rude one to see if our Judge can spot the difference.
from langchain_classic.evaluation import load_evaluator
custom_friendliness = {
"friendliness": "Is the response written in a friendly and approachable tone?"
}
friendliness_evaluator = load_evaluator(
"criteria", criteria=custom_friendliness, llm=llm_as_judge
)
user_prompt = "How many people live in canada as of 2023?"
friendly_agent_answer = "A billion people live in canada as of 2023, do you need any other information about canada ? It is my pleasure to help you"
not_friendly_agent_answer = "Stop asking stupid questions, you should look for this information by yourself"
result_1 = friendliness_evaluator.evaluate_strings(
prediction=friendly_agent_answer, input=user_prompt
)
print("Evaluation result for friendly answer :", result_1)
result_2 = friendliness_evaluator.evaluate_strings(
prediction=not_friendly_agent_answer, input=user_prompt
)
print("Evaluation result for not friendly answer :", result_2)
>> Evaluation result for friendly answer : {'reasoning': '- The criterion is friendliness: whether the response is written in a friendly and approachable tone.\n- The submission uses polite and welcoming language, e.g., "do you need any other information about canada ?" and "It is my pleasure to help you."\n- There is no hostile, dismissive, or curt language; the tone is inviting and helpful.\n- Despite factual inaccuracies, the tone itself is friendly and approachable.\n\nY', 'value': 'Y', 'score': 1}
>> Evaluation result for not friendly answer : {'reasoning': 'Step-by-step reasoning:\n- Identify the criterion: friendliness—tone should be friendly and approachable.\n- Analyze the submission’s language: it uses insulting phrasing (“stupid questions”), a dismissive command (“Stop asking”), and shifts responsibility in a curt way (“look for this information by yourself”).\n- Assess tone: the language is rude, confrontational, and unhelpful, which is the opposite of friendly and approachable.\n- Conclusion: the submission does not meet the friendliness criterion.\n\nN', 'value': 'N', 'score': 0}
Case 3: Evaluating Tool Trajectories
Let’s look at a ReAct agent using Yahoo Finance and DuckDuckGo. We don’t just look at the final answer; we look at the trajectory (the steps taken).
from langchain.agents import create_agent
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_community.tools.yahoo_finance_news import YahooFinanceNewsTool
tools = [DuckDuckGoSearchRun(), YahooFinanceNewsTool()]
agent = create_agent(model="gpt-4.1-nano-2025-04-14", tools=tools)We ask the agent about the value of one apple action ans whether it is a buy or a sell
query = "What is the current stock market price of Apple ? nad do you think it is a buy or a sell action ?"
result = agent.invoke({"messages": [query]})
print(result["messages"][-1])>> The current stock price of Apple (AAPL) is approximately $259.23. According to recent analyst consensus, Apple holds a moderate buy recommendation, with an average 12-month price target of $299.49, implying potential for about 15% upside from the current levels. However, whether to buy or sell depends on your individual investment strategy and risk tolerance.
We extract the intermediate steps leading to final answer:
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
intermediate_steps = []
i = 1
for message in result["messages"][1:-1]:
if isinstance(message, AIMessage):
step = f"- Step {i}: The agent selected the following tools \n" + "\n".join([
f"""\t- tool_name: {tool_call["name"]}, tool_args: {tool_call["args"]}"""
for tool_call in message.tool_calls
])
intermediate_steps.append(step)
if isinstance(message, ToolMessage):
intermediate_steps.append(f"- Step {i}: The tool {message.name} has responded: {message.content}")
i+= 1
intermediate_steps = "\n".join(intermediate_steps)Finally, we ask our Judge (ChatGPT-5) to grade the execution on three pillars:
-
Tool Selection Accuracy
-
Parameter Precision
-
Final Correctness
import json
def evaluate_agent_step(input_query, trajectory, final_output, reference_answer):
"""
Evaluates a ReAct agent's trajectory based on TEval principles:
Tool selection, Reasoning logic, and Final Accuracy.
"""
evaluation_prompt = f"""
You are an expert AI Evaluator specializing in ReAct (Reasoning + Acting) patterns.
### TASK
Evaluate the following agent's performance based on the Ground Truth provided.
### INPUT DATA
- User Query: {input_query}
- Trajectory: {trajectory}
- Final Output: {final_output}
- Ground Truth: {reference_answer}
### EVALUATION CRITERIA
1. **Tool Selection Accuracy**: Did the agent pick the right tools for the task?
2. **Parameter Precision**: Were the inputs passed to the tools correct and optimized?
3. **Final Correctness**: Does the final answer match the Ground Truth?
The agent has access to the following tools:
- Tool 1: {tools[0].name}: {tools[0].description}
- Tool 2: {tools[1].name} : {tools[1].description}
### OUTPUT FORMAT
Return ONLY a JSON object with this structure:
{{
"scores": {{
"tool_selection": (int 1-5),
"parameter_precision": (int 1-5),
"final_accuracy": (int 1-5)
}},
"critique": "Brief explanation of the scores",
"status": "PASS/FAIL"
}}
"""
judge = ChatOpenAI(model="gpt-5", temperature=0)
response = judge.invoke(evaluation_prompt)
# Parsing the string response into a Python dictionary
return json.loads(response.content)
evaluate_agent_step(
input_query=query,
trajectory=intermediate_steps,
final_output=result["messages"][-1].content,
reference_answer="Apple stock price is 259.04, it is a buy stock")>> {'scores': {'tool_selection': 3,
'parameter_precision': 3,
'final_accuracy': 2},
'critique': "The agent initially chose the news tool (yahoo_finance_news) for a price query and passed an incorrect parameter ('Apple Inc.' instead of 'AAPL'). It later used duckduckgo_search appropriately and corrected the ticker for the news tool, though the news tool was unnecessary for price retrieval. The final answer reported $259.23 instead of the ground truth $259.04 and did not clearly state a 'buy' recommendation, instead hedging despite citing a moderate buy consensus.",
'status': 'FAIL'}
Final Thoughts: The Road to Reliable AI
Building an AI agent is only 20% of the journey; the remaining 80% is proving that it works reliably, safely, and cost-effectively. As we move away from simple chatbots toward autonomous agents that can actually « act » on the world, our testing methods must evolve too.
By implementing a multi-layered evaluation strategy—combining the speed of LLM-as-a-judge with the deep nuance of human expertise—you’re not just building a cool demo. You’re building a production-ready tool that provides real business value.
Remember: Evaluation is a marathon, not a sprint. Every time your agent fails in production, don’t just fix the prompt—add that failure to your testing dataset. This continuous loop is what separates « experimental toys » from « enterprise-grade AI. »