Track

Evals

Measure · Judge · Trace · Improve

How do you know if your agent is actually working? A single demo run looks promising, but what about edge cases? What about the time the agent silently ignored its instructions? What about prompt injection?

Evaluation is how you get answers to these questions. Not by reading outputs and guessing, but by running systematic checks that measure specific behaviors — correctness, tool usage, safety, trajectory quality. The goal is to catch regressions before they reach production and to prove that changes actually improve the agent.

Exact Match with Normalization

The simplest eval: compare the agent's output to a known reference. But exact string comparison is too brittle — 'Hello!' and 'hello' should match. Normalization handles this by lowering case, stripping punctuation, and collapsing whitespace before comparison.

Both the prediction and the reference go through the same normalization pipeline. This catches the common case where the agent says the right thing in a slightly different format.

Rubric Scoring

Exact match only checks the final answer. For agent behavior, you often want to score multiple dimensions separately: correctness, tool usage, error handling, and efficiency.

Each dimension gets a score from 0 to 1, and the overall score is a weighted combination. This gives you a profile of the agent's strengths and weaknesses, not just a single pass/fail.

Concrete Example

Normalized Exact Match

import string

_PUNCT = str.maketrans("", "", string.punctuation)

def normalize(s):
    return " ".join(
        s.lower().translate(_PUNCT).split()
    )

def exact_match(preds, refs):
    if not preds:
        return 0.0
    matches = sum(
        1 for p, r in zip(preds, refs)
        if normalize(p) == normalize(r)
    )
    return matches / len(preds)

Both predictions and references pass through the same normalizer: lowercase, strip punctuation, collapse whitespace. Then zip and compare. Returns accuracy in [0, 1]. Simple, fast, and catches formatting differences that would break naive string comparison.

Key Ideas

Exact Match

Fast, deterministic comparison after normalization. Catches format-only differences.

Rubric Scoring

Multi-dimensional evaluation that profiles agent strengths across categories.

Prompt Injection Detection

Check if the agent was tricked into ignoring its instructions.

Trajectory Comparison

Validate the agent's reasoning path, not just the final answer.

Aggregation

Run multiple seeds or variants and compute aggregate pass rates for confidence.

Problems in this track

6 problems. Sign in to start solving.

TitleDifficultyAcceptanceEst.

Exact-Match Eval with Normalization

Score predictions against references after lowercase/whitespace/punctuation normalization.

Easy85%15m

Score Rubric Categories from Test Runs

Translate raw test output into named rubric buckets with partial credit.

Medium53%25m

Detect Prompt Injection in Model Output

Identify outputs that attempt to override system or task instructions.

Hard35%35m

Compare Expected vs Actual Tool Traces

Evaluate agent trajectories against an expected trace order.

Hard38%35m

Aggregate Pass Rates Across Seeds

Average pass rates across multiple stochastic runs.

Medium55%20m

Build a Hidden-Test Failure Summary

Summarize hidden-test failures without leaking the test content.

Medium51%25m

Exact Match with Normalization

Both the prediction and the reference go through the same normalization pipeline. This catches the common case where the agent says the right thing in a slightly different format.

Rubric Scoring

Exact match only checks the final answer. For agent behavior, you often want to score multiple dimensions separately: correctness, tool usage, error handling, and efficiency.

Each dimension gets a score from 0 to 1, and the overall score is a weighted combination. This gives you a profile of the agent's strengths and weaknesses, not just a single pass/fail.

Normalized Exact Match

import string _PUNCT = str.maketrans("", "", string.punctuation) def normalize(s): return " ".join( s.lower().translate(_PUNCT).split() ) def exact_match(preds, refs): if not preds: return 0.0 matches = sum( 1 for p, r in zip(preds, refs) if normalize(p) == normalize(r) ) return matches / len(preds)

Key Ideas

Exact Match

Fast, deterministic comparison after normalization. Catches format-only differences.

Rubric Scoring

Multi-dimensional evaluation that profiles agent strengths across categories.

Prompt Injection Detection

Check if the agent was tricked into ignoring its instructions.

Trajectory Comparison

Validate the agent's reasoning path, not just the final answer.

Aggregation

Run multiple seeds or variants and compute aggregate pass rates for confidence.