Assess AI agent performance at three levels: Final Response (end results), Single Step (individual actions), and Trajectory (execution paths). This guide helps you identify failure points, optimize costs, and improve agent reliability.

Additional Resources

For detailed grader documentation, see Built-in Graders.

Why Evaluate AI Agents?

AI agents operate autonomously through complex reasoning loops, making multiple tool calls and decisions before reaching a final answer. This multi-step nature creates unique evaluation challenges—a wrong tool selection early on can cascade into complete task failure.

Systematic evaluation enables you to:

  • Identify failure points — Pinpoint issues in planning, tool selection, or execution
  • Optimize costs — Reduce unnecessary tool calls and LLM iterations
  • Ensure reliability — Validate performance before deployment
  • Continuously improve — Drive enhancements through data-driven insights

Choose the Right Evaluation Granularity

Granularity What It Measures When to Use
Final Response Overall task success and answer quality Production monitoring, A/B testing
Single Step Individual action quality (tool calls, planning) Debugging failures, prompt engineering
Trajectory Multi-step reasoning paths and efficiency Cost optimization, training reward models

Evaluation Strategy

Start with Final Response evaluation to establish baseline success rates. When failures occur, use Single Step evaluation to pinpoint root causes. Use Trajectory evaluation to detect systemic issues like loops or inefficiencies.

Evaluate Final Response

Assess the end result of agent execution to determine if the agent successfully completed the user's task.

Step 1: Choose a Grader

Suppose you want to evaluate: Is the agent's final answer factually correct compared to a reference answer?

This is a correctness evaluation task. OpenJudge provides the CorrectnessGrader for exactly this purpose—it compares the response against a reference and scores accuracy on a 1-5 scale.

Your Scenario Recommended Grader
Is the answer correct? CorrectnessGrader
Does the response answer the question? RelevanceGrader
Does the response contain hallucinations? HallucinationGrader
Is the response harmful or unsafe? HarmfulnessGrader

For a complete list of available graders, see Built-in Graders.

In this example, we'll use CorrectnessGrader to evaluate the agent's final answer.

Step 2: Initialize the Model

from openjudge.models import OpenAIChatModel

# Uses OPENAI_API_KEY and OPENAI_BASE_URL from environment
model = OpenAIChatModel(model="qwen3-32b")
from openjudge.models import OpenAIChatModel

model = OpenAIChatModel(
    model="qwen3-32b",
    api_key="your-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

Step 3: Prepare Your Data

Prepare a dictionary with query and response fields:

dataset = [
    {
        "query": "What is the capital of France?",
        "response": "The capital of France is Paris."
    }
]

Step 4: Run Evaluation

import asyncio
from openjudge.graders.common import CorrectnessGrader
from openjudge.models import OpenAIChatModel

async def main():
    # Initialize model and grader
    model = OpenAIChatModel(
        model="qwen3-32b",
        api_key="your-api-key",
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
    )
    grader = CorrectnessGrader(model=model)

    # Prepare data
    data = {
        "query": "What is the capital of France?",
        "response": "The capital of France is Paris."
    }

    # Evaluate
    result = await grader.aevaluate(**data)
    print(result)

asyncio.run(main())

Output:

GraderScore(
    name='correctness',
    score=5.0,
    reason="The response correctly states that the capital of France is Paris, which is factually consistent with the reference response 'Paris'. The added phrasing 'The capital of France is' provides appropriate context without contradicting or distorting the reference."
)

Evaluate Single Step

Assess individual agent decisions in isolation—one tool call, one planning step, or one memory retrieval at a time.

Step 1: Choose a Grader

Suppose you want to evaluate: Did the agent select the most appropriate tool for the current sub-task?

This is a tool selection evaluation task. OpenJudge provides the ToolSelectionGrader for exactly this purpose—it assesses whether the chosen tool matches the task requirements.

Your Scenario Recommended Grader
Did the agent select the right tool? ToolSelectionGrader
Did the tool call succeed? ToolCallSuccessGrader
Is the plan feasible? PlanFeasibilityGrader
Is the memory retrieval accurate? MemoryAccuracyGrader
Is the reflection accurate? ReflectionAccuracyGrader

For a complete list of available graders, see Agent Graders.

In this example, we'll use ToolSelectionGrader to evaluate the agent's tool choice.

Step 2: Initialize the Model

from openjudge.models import OpenAIChatModel

# Uses OPENAI_API_KEY and OPENAI_BASE_URL from environment
model = OpenAIChatModel(model="qwen3-32b")
from openjudge.models import OpenAIChatModel

model = OpenAIChatModel(
    model="qwen3-32b",
    api_key="your-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

Step 3: Prepare Your Data

Single Step graders require specific fields extracted from your agent traces. Prepare a dictionary with query, tool_definitions, and tool_calls:

data = {
    "query": "What's 15% tip on a $45 bill?",
    "tool_definitions": [
        {"name": "calculator", "description": "Perform mathematical calculations"},
        {"name": "search_web", "description": "Search the web for information"}
    ],
    "tool_calls": [
        {"name": "calculator", "arguments": '{"expression": "45 * 0.15"}'}
    ]
}

Extracting from Agent Traces

If your data is in OpenAI messages format, you'll need to extract the relevant fields. See the complete example below for a mapper function.

Step 4: Run Evaluation

import asyncio
from openjudge.graders.agent import ToolSelectionGrader
from openjudge.models import OpenAIChatModel

async def main():
    # Initialize model and grader
    model = OpenAIChatModel(
        model="qwen3-32b",
        api_key="your-api-key",
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
    )
    grader = ToolSelectionGrader(model=model)

    # Prepare data
    data = {
        "query": "What's 15% tip on a $45 bill?",
        "tool_definitions": [
            {"name": "calculator", "description": "Perform mathematical calculations"},
            {"name": "search_web", "description": "Search the web for information"}
        ],
        "tool_calls": [
            {"name": "calculator", "arguments": '{"expression": "45 * 0.15"}'}
        ]
    }

    # Evaluate
    result = await grader.aevaluate(**data)
    print(result)

asyncio.run(main())

Output:

GraderScore(
    name='tool_selection',
    score=5.0,
    reason="The agent selected the 'calculator' tool with the expression '45 * 0.15', which is the most direct and efficient tool for computing a percentage-based tip. The query is purely mathematical, requiring no external information. The calculator tool is fully capable of performing this arithmetic operation accurately."
)

Evaluate Trajectory

Assess the entire sequence of agent actions to determine if the agent took an optimal path without loops or redundant steps.

Step 1: Choose a Grader

Suppose you want to evaluate: Did the agent complete the task efficiently without unnecessary steps or loops?

This is a trajectory evaluation task. OpenJudge provides the TrajectoryComprehensiveGrader for exactly this purpose—it analyzes the full execution path for efficiency and correctness.

For a complete list of available graders, see Agent Graders.

In this example, we'll use TrajectoryComprehensiveGrader to evaluate the agent's execution path.

Step 2: Initialize the Model

from openjudge.models import OpenAIChatModel

# Uses OPENAI_API_KEY and OPENAI_BASE_URL from environment
model = OpenAIChatModel(model="qwen3-32b")
from openjudge.models import OpenAIChatModel

model = OpenAIChatModel(
    model="qwen3-32b",
    api_key="your-api-key",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

Step 3: Prepare Your Data

Prepare a full agent trajectory in OpenAI messages format:

data = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant with tools."},
        {"role": "user", "content": "What's the weather in Tokyo?"},
        {
            "role": "assistant",
            "content": "I'll check the weather for you.",
            "tool_calls": [{
                "id": "call_1",
                "function": {"name": "get_weather", "arguments": '{"location": "Tokyo"}'}
            }]
        },
        {
            "role": "tool",
            "tool_call_id": "call_1",
            "name": "get_weather",
            "content": '{"temp": 22, "condition": "sunny"}'
        },
        {
            "role": "assistant",
            "content": "The weather in Tokyo is sunny with 22°C."
        }
    ]
}

Step 4: Run Evaluation

import asyncio
from openjudge.graders.agent import TrajectoryComprehensiveGrader
from openjudge.models import OpenAIChatModel

async def main():
    # Initialize model and grader
    model = OpenAIChatModel(
        model="qwen3-32b",
        api_key="your-api-key",
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
    )
    grader = TrajectoryComprehensiveGrader(model=model)

    # Prepare data
    data = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant with tools."},
            {"role": "user", "content": "What's the weather in Tokyo?"},
            {
                "role": "assistant",
                "content": "I'll check the weather for you.",
                "tool_calls": [{
                    "id": "call_1",
                    "function": {"name": "get_weather", "arguments": '{"location": "Tokyo"}'}
                }]
            },
            {
                "role": "tool",
                "tool_call_id": "call_1",
                "name": "get_weather",
                "content": '{"temp": 22, "condition": "sunny"}'
            },
            {
                "role": "assistant",
                "content": "The weather in Tokyo is sunny with 22°C."
            }
        ]
    }

    # Evaluate
    result = await grader.aevaluate(**data)
    print(result)

asyncio.run(main())

Output:

GraderScore(
    name='trajectory_comprehensive',
    score=1.0,
    reason="The agent efficiently completed the task in a single tool call. It correctly identified the need for weather information, selected the appropriate tool, and provided a clear, accurate response based on the tool output. No unnecessary steps or loops were detected."
)

Batch Evaluation with GradingRunner

For evaluating multiple agent traces efficiently, use GradingRunner to run graders concurrently with automatic progress tracking:

import asyncio
from openjudge.graders.agent import ToolSelectionGrader
from openjudge.models import OpenAIChatModel
from openjudge.runner.grading_runner import GradingRunner, GraderConfig

async def main():
    # Initialize model and grader
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ToolSelectionGrader(model=model)

    # Define mapper to extract grader inputs from agent traces
    def extract_tool_inputs(data: dict) -> dict:
        messages = data["messages"]
        query = next((m["content"] for m in messages if m["role"] == "user"), "")
        tool_calls = []
        for msg in messages:
            if msg.get("role") == "assistant" and msg.get("tool_calls"):
                for tc in msg["tool_calls"]:
                    tool_calls.append({
                        "name": tc["function"]["name"],
                        "arguments": tc["function"]["arguments"]
                    })
        return {
            "query": query,
            "tool_definitions": data["available_tools"],
            "tool_calls": tool_calls
        }

    # Configure runner with mapper
    runner = GradingRunner(
        grader_configs={
            "tool_selection": GraderConfig(
                grader=grader,
                mapper=extract_tool_inputs
            )
        },
        max_concurrency=16,
        show_progress=True
    )

    # Prepare dataset (agent traces)
    dataset = [
        {   # Bad case: should use calculator, not search_web
            "messages": [
                {"role": "user", "content": "What's 15% tip on a $45 bill?"},
                {"role": "assistant", "tool_calls": [{"function": {"name": "search_web", "arguments": '{"query": "15% tip on $45"}'}}]}
            ],
            "available_tools": [
                {"name": "calculator", "description": "Perform mathematical calculations"},
                {"name": "search_web", "description": "Search the web for information"}
            ]
        },
        {   # Good case: correctly uses get_weather
            "messages": [
                {"role": "user", "content": "What's the weather in Tokyo?"},
                {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": '{"location": "Tokyo"}'}}]}
            ],
            "available_tools": [
                {"name": "get_weather", "description": "Get weather information"},
                {"name": "search_web", "description": "Search the web for information"}
            ]
        },
    ]

    # Run batch evaluation
    results = await runner.arun(dataset)

    # Print results
    for i, result in enumerate(results["tool_selection"]):
        print(f"Trace {i}: Score={result.score}")

asyncio.run(main())

Output:

Trace 0: Score=2.0   # Wrong tool: used search_web instead of calculator
Trace 1: Score=5.0   # Correct tool: used get_weather for weather query

For more details on batch evaluation, data mapping, and result aggregation, see Run Grading Tasks.

Next Steps