Track
Measure · Judge · Trace · Improve
How do you know if your agent is actually working? A single demo run looks promising, but what about edge cases? What about the time the agent silently ignored its instructions? What about prompt injection?
Evaluation is how you get answers to these questions. Not by reading outputs and guessing, but by running systematic checks that measure specific behaviors — correctness, tool usage, safety, trajectory quality. The goal is to catch regressions before they reach production and to prove that changes actually improve the agent.
The simplest eval: compare the agent's output to a known reference. But exact string comparison is too brittle — 'Hello!' and 'hello' should match. Normalization handles this by lowering case, stripping punctuation, and collapsing whitespace before comparison.
Both the prediction and the reference go through the same normalization pipeline. This catches the common case where the agent says the right thing in a slightly different format.
Exact match only checks the final answer. For agent behavior, you often want to score multiple dimensions separately: correctness, tool usage, error handling, and efficiency.
Each dimension gets a score from 0 to 1, and the overall score is a weighted combination. This gives you a profile of the agent's strengths and weaknesses, not just a single pass/fail.
Concrete Example
import string
_PUNCT = str.maketrans("", "", string.punctuation)
def normalize(s):
return " ".join(
s.lower().translate(_PUNCT).split()
)
def exact_match(preds, refs):
if not preds:
return 0.0
matches = sum(
1 for p, r in zip(preds, refs)
if normalize(p) == normalize(r)
)
return matches / len(preds)Both predictions and references pass through the same normalizer: lowercase, strip punctuation, collapse whitespace. Then zip and compare. Returns accuracy in [0, 1]. Simple, fast, and catches formatting differences that would break naive string comparison.
Fast, deterministic comparison after normalization. Catches format-only differences.
Multi-dimensional evaluation that profiles agent strengths across categories.
Check if the agent was tricked into ignoring its instructions.
Validate the agent's reasoning path, not just the final answer.
Run multiple seeds or variants and compute aggregate pass rates for confidence.
6 problems. Sign in to start solving.
Sign in to open a workspace and solve these problems.