Agent Graders

Evaluate AI agent behavior across actions, tools, memory, planning, reflection, and trajectories. These graders help you assess decision quality, detect failures, and optimize agent performance at every step.

Overview

Category	Grader	Purpose	Type	Score Range	Key Use Case
Action	`ActionAlignmentGrader`	Evaluates action-plan consistency	LLM-Based	{0, 1}	ReAct agents, step-by-step reasoning
	`ActionLoopDetectionGrader`	Detects repetitive actions	Code-Based	[0, 1]	Multi-step exploration tasks
Tool	`ToolSelectionGrader`	Assesses tool choice quality	LLM-Based	1-5	Function calling agents
	`ToolCallAccuracyGrader`	Evaluates tool call accuracy	LLM-Based	1-5	API-based assistants
	`ToolCallSuccessGrader`	Checks technical execution success	LLM-Based	{0, 1}	Production agent monitoring
	`ToolParameterCheckGrader`	Validates parameter correctness	LLM-Based	{0, 1}	Slot-filling dialogues
	`ToolCallSequenceMatchGrader`	Compares tool call sequences	Code-Based	[0, 1]	Benchmark evaluation
Memory	`MemoryAccuracyGrader`	Validates memory factuality	LLM-Based	{0, 1}	Memory-augmented agents
	`MemoryDetailPreservationGrader`	Checks detail retention	LLM-Based	{0, 1}	Long-horizon tasks
	`MemoryRetrievalEffectivenessGrader`	Assesses memory retrieval	LLM-Based	{0, 1}	RAG-based agents
Plan	`PlanFeasibilityGrader`	Evaluates plan feasibility	LLM-Based	{0, 1}	Task planning agents
Reflection	`ReflectionAccuracyGrader`	Validates reflection accuracy	LLM-Based	{0, 1}	Self-correcting agents
	`ReflectionOutcomeUnderstandingGrader`	Checks outcome understanding	LLM-Based	{0, 1}	Error recovery scenarios
	`ReflectionProgressAwarenessGrader`	Assesses progress awareness	LLM-Based	{0, 1}	Goal-tracking agents
Observation	`ObservationInformationGainGrader`	Measures information gain	Code-Based	[0, 1]	Exploration efficiency
Trajectory	`TrajectoryComprehensiveGrader`	Comprehensive trajectory evaluation	LLM-Based	[0, 1]	End-to-end agent testing

Performance

Benchmark results using qwen3-max on agent evaluation tasks:

Category	Grader	Samples	Preference Accuracy	Source
Action	ActionAlignmentGrader	8	88%	ALFWorld, WebShop, GAIA
Tool	ToolCallAccuracyGrader	40	90%	API-Bank
Tool	ToolCallSuccessGrader	20	95%	API-Bank
Tool	ToolParameterCheckGrader	20	75%	API-Bank
Tool	ToolSelectionGrader	20	85%	API-Bank
Memory	MemoryAccuracyGrader	18	78%	ALFWorld, WebShop, GAIA
Memory	MemoryDetailPreservationGrader	25	76%	ALFWorld, WebShop, GAIA
Memory	MemoryRetrievalEffectivenessGrader	4	100%	ALFWorld
Plan	PlanFeasibilityGrader	7	86%	ALFWorld, GAIA
Reflection	ReflectionAccuracyGrader	2	100%	ALFWorld
Reflection	ReflectionOutcomeUnderstandingGrader	23	78%	ALFWorld, GAIA
Reflection	ReflectionProgressAwarenessGrader	28	74%	ALFWorld, WebShop, GAIA

Performance Metrics

Preference Accuracy measures alignment with human-annotated preference labels (positive and negative samples) on agent evaluation tasks. Higher is better.

Benchmark Sources:

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs (EMNLP 2023) — A benchmark with 73 API tools and 314 tool-use dialogues for evaluating LLM tool utilization capabilities.
ALFWorld, WebShop, GAIA: Evaluation datasets from Where LLM Agents Fail and How They can Learn From Failures — A systematic study of agent failure modes with the AgentErrorTaxonomy covering memory, reflection, planning, and action modules.

Action Graders

ActionAlignmentGrader

Evaluates whether agent actions align with stated plans or reasoning.

Use this grader to:

Verify consistency between planning and execution
Debug agent decision-making processes
Ensure actions follow stated intentions

Evaluation criteria: Direct plan implementation, correct object targeting, goal contribution, logical sequence, and constraint respect.

Parameters:

Parameter	Type	Required	Description
`plan`	str	Yes	Agent's planning/reasoning statement
`action`	str	Yes	Agent's executed action
`history`	List[dict]	No	Previous step dictionaries for context
`context`	str	No	Task context (description, environment, available actions)

Scoring: - 1.0: Good alignment - action follows plan logically - 0.0: Poor alignment - action inconsistent with plan

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ActionAlignmentGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ActionAlignmentGrader(model=model)

    result = await grader.aevaluate(
        plan="I will open drawer 1 to find the key.",
        action="open drawer 1",
        context="Task: Find the key to unlock the door"
    )

    print(f"Score: {result.score}")   # 1.0 - good alignment
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The action 'open drawer 1' directly implements the stated plan 'I will open drawer 1 to find the key.' It targets the correct object (drawer 1), contributes to achieving the goal of finding the key, follows the logical order outlined in the plan, and respects any implied preconditions (e.g., needing to open the drawer to access its contents). The alignment is clear and direct, so confidence is high.

ActionLoopDetectionGrader

Detects repetitive or similar actions in agent sequences.

Use this grader to:

Identify when agents get stuck in loops
Detect inefficient exploration strategies
Debug stuck agents in multi-step tasks

Evaluation criteria: Compares all pairs of action signatures for similarity.

Parameters:

Parameter	Type	Required	Description
`messages`	List[Dict[str, Any]]	Yes	Message list containing agent interactions
`similarity_threshold`	float	No	Threshold to consider actions similar (default: 1.0)

Scoring: - 1.0: No loops detected - 0.0: Many similar action pairs found - Score computed as: 1.0 - (similar_pairs / total_pairs)

Example:

import asyncio
from openjudge.graders.agent import ActionLoopDetectionGrader

async def main():
    grader = ActionLoopDetectionGrader(similarity_threshold=1.0)

    messages = [
        {"role": "assistant", "tool_calls": [{"id": "1", "function": {"name": "search", "arguments": '{"query": "python"}'}}]},
        {"role": "tool", "tool_call_id": "1", "content": "Results..."},
        {"role": "assistant", "tool_calls": [{"id": "2", "function": {"name": "search", "arguments": '{"query": "python"}'}}]},
        {"role": "tool", "tool_call_id": "2", "content": "Results..."},
    ]

    result = await grader.aevaluate(messages=messages)

    print(f"Score: {result.score}")   # Lower score indicates loop
    print(f"Similar pairs: {result.metadata['similar_pair_count']}")

asyncio.run(main())

Output:

Score: 0.0
Similar pairs: 1

Tool Graders

ToolSelectionGrader

Evaluates tool selection quality for addressing user queries.

Use this grader to:

Assess tool choice appropriateness
Evaluate agent decision-making quality
Compare different agent architectures

Evaluation criteria: Tool relevance, selection completeness, efficiency, and understanding of tool capabilities.

Parameters:

Parameter	Type	Required	Description
`query`	str or List[Dict]	Yes	User query or conversation history
`tool_definitions`	List[Dict[str, Any]]	Yes	Available tool definitions
`tool_calls`	List[Dict[str, Any]]	Yes	Tools actually selected by agent

Scoring: - 5: Optimal tool selection - most direct and efficient - 4: Reasonable selection - can complete task but not optimal - 3: Acceptable - related but not direct match - 2: Poor - clearly mismatched with task - 1: Incorrect - no tool selected or completely irrelevant

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ToolSelectionGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ToolSelectionGrader(model=model)

    result = await grader.aevaluate(
        query="Find all Python files modified in the last week",
        tool_definitions=[
            {"name": "search_files", "description": "Search for files by pattern"},
            {"name": "git_log", "description": "Get git commit history"},
            {"name": "read_file", "description": "Read file contents"}
        ],
        tool_calls=[
            {"name": "search_files", "arguments": {"pattern": "*.py"}},
            {"name": "git_log", "arguments": {"days": 7}}
        ]
    )

    print(f"Score: {result.score}")   # 4-5 - good tool selection
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 5.0
Reason: The selected tools are highly relevant and directly address the user's query. The 'search_files' tool with the pattern '*.py' effectively identifies all Python files in the system, while the 'git_log' tool with the argument 'days: 7' retrieves the commit history for the last week, which can be used to determine which of those Python files were modified recently. Together, these tools provide a complete and efficient solution without including any unnecessary or redundant tools. The selection demonstrates a clear understanding of both the task intent and the capabilities of the available tools.

ToolCallAccuracyGrader

Evaluates tool call accuracy including parameter correctness and query relevance.

Use this grader to:

Validate tool call correctness
Assess parameter extraction accuracy
Evaluate agent tool-use capability

Evaluation criteria: Tool relevance to query and parameter correctness according to definitions.

Parameters:

Parameter	Type	Required	Description
`query`	str or List[Dict]	Yes	Query or chat history
`tool_definitions`	List[Dict[str, Any]]	Yes	Tool definitions with parameters
`tool_calls`	List[Dict[str, Any]]	No	Tool calls to evaluate (or provide `response`)
`response`	str or List[Dict]	No	Response containing tool calls

Scoring: - 5: Fully relevant, all parameters correct - 4: Relevant, tools returned errors but agent retried successfully - 3: Relevant but unnecessary/excessive calls - 2: Partially relevant, insufficient tools or incorrect parameters - 1: Irrelevant or tool names not found in definitions

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ToolCallAccuracyGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ToolCallAccuracyGrader(model=model)

    conversation = [
        {"role": "user", "content": "What's the weather like in New York?"}
    ]

    tool_definitions = [
        {
            "name": "get_weather",
            "description": "Get weather information for a location",
            "parameters": {"location": "City name"}
        }
    ]

    tool_calls = [
        {
            "name": "get_weather",
            "arguments": {"location": "New York"}
        }
    ]

    result = await grader.aevaluate(
        query=conversation,
        tool_definitions=tool_definitions,
        tool_calls=tool_calls
    )

    print(f"Score: {result.score}")   # 5.0 - accurate tool call
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 5.0
Reason: The tool call 'get_weather' is fully relevant to the user's query about the weather in New York. The name of the tool call matches one of the function names in the tool definitions, and the parameter 'location' with the value 'New York' is correctly extracted from the conversation and aligns with the description in the tool definition.

ToolCallSuccessGrader

Evaluates technical execution success of tool calls (no errors, exceptions, or timeouts).

Use this grader to:

Detect technical failures in tool execution
Monitor agent reliability
Debug tool integration issues

Evaluation criteria: Checks for technical execution success (no errors, exceptions, or timeouts). Does not evaluate business correctness.

Parameters:

Parameter	Type	Required	Description
`tool_definitions`	List[Dict[str, Any]]	Yes	Tool definitions for context
`tool_calls`	List[Dict[str, Any]]	Yes	Tool calls to evaluate (name and arguments)
`tool_responses`	str or List[str]	Yes	Tool responses corresponding to each tool call

Scoring: - 1.0: All tool calls successful - 0.0: At least one tool call failed

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ToolCallSuccessGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ToolCallSuccessGrader(model=model)

    tool_definitions = [
        {
            "name": "get_weather",
            "description": "Get weather information",
            "parameters": {"location": "City name"}
        }
    ]

    tool_calls = [
        {
            "name": "get_weather",
            "arguments": {"location": "New York"}
        }
    ]

    tool_responses = [
        "The weather in New York is sunny and 25 degrees Celsius."
    ]

    result = await grader.aevaluate(
        tool_definitions=tool_definitions,
        tool_calls=tool_calls,
        tool_responses=tool_responses
    )

    print(f"Score: {result.score}")   # 1.0 - successful
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The tool call executed successfully, returned a non-empty result, and did not contain any error messages or exceptions.

ToolParameterCheckGrader

Evaluates parameter extraction accuracy from user queries.

Use this grader to:

Validate parameter extraction accuracy
Ensure grounded parameter values
Detect hallucinated parameters

Evaluation criteria: Parameter completeness, accuracy, grounding, and correct mapping.

Parameters:

Parameter	Type	Required	Description
`query`	str or List[Dict]	Yes	User query or conversation history
`tool_definitions`	List[Dict[str, Any]]	Yes	Tool definitions with parameter specifications
`tool_calls`	List[Dict[str, Any]]	Yes	Tool calls made by the agent

Scoring: - 1.0: All parameters correct and complete - 0.0: Parameters have issues (missing, incorrect, or fabricated)

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ToolParameterCheckGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ToolParameterCheckGrader(model=model)

    result = await grader.aevaluate(
        query="Search for Python files in the src directory",
        tool_definitions=[
            {
                "name": "search_files",
                "parameters": {"pattern": "str", "directory": "str"}
            }
        ],
        tool_calls=[
            {
                "name": "search_files",
                "arguments": {"pattern": "*.py", "directory": "src"}
            }
        ]
    )

    print(f"Score: {result.score}")   # 1.0 - correct parameters
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The tool call correctly extracted all required parameters from the user query. The 'pattern' parameter was set to '*.py', which accurately reflects the intent to search for Python files. The 'directory' parameter was set to 'src', matching the specified directory in the query. Both parameters are present, grounded in the query, and formatted correctly as strings. There are no hallucinations or missing parameters, and the data types align with the tool's definition. The tool call is fully executable with correct parameters.

ToolCallSequenceMatchGrader

Compares agent tool call sequences against reference sequences.

Use this grader for:

Benchmark evaluation against ground truth
Trajectory comparison and validation
A/B testing different agent implementations

Evaluation criteria: Strict mode matches name + parameters; loose mode matches name only.

Parameters:

Parameter	Type	Required	Description
`messages`	List[Dict[str, Any]]	Yes	Agent's message history with tool calls
`reference_tool_calls`	List[List[Dict[str, Any]]]	Yes	Ground truth reference tool sequence by steps
`strict_mode`	bool	No	Match name + parameters (True) or name only (False), default: True
`use_jaccard_similarity`	bool	No	Use Jaccard similarity ignoring order (True) or step-by-step (False), default: True

Scoring: - Strict mode with Jaccard: Intersection over union of (tool_name, parameters) pairs - Loose mode with Jaccard: Intersection over union of tool names - Step-by-step mode: Average F1 score across steps - Range: 0.0 (no match) to 1.0 (perfect match)

Example:

import asyncio
from openjudge.graders.agent import ToolCallSequenceMatchGrader

async def main():
    grader = ToolCallSequenceMatchGrader(
        strict_mode=True,
        use_jaccard_similarity=True
    )

    messages = [
        {"role": "assistant", "tool_calls": [
            {"id": "1", "function": {"name": "search", "arguments": '{"query": "python"}'}}
        ]},
        {"role": "tool", "tool_call_id": "1", "content": "Results..."},
    ]

    reference_tool_calls = [
        [
            {"name": "search", "arguments": {"query": "python"}}
        ]
    ]

    result = await grader.aevaluate(
        messages=messages,
        reference_tool_calls=reference_tool_calls
    )

    print(f"Score: {result.score}")   # 1.0 - perfect match
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: Tool call sequence evaluation (strict mode, jaccard): jaccard_similarity=1.000

Memory Graders

MemoryAccuracyGrader

Evaluates accuracy and factuality of stored memory content.

Use this grader to:

Validate memory system correctness
Ensure grounded information storage
Debug hallucination in memory

Evaluation criteria: Memory reflects actual observations, stores only factual details, and maintains accurate associations.

Parameters:

Parameter	Type	Required	Description
`observation`	str	Yes	Agent's observation from environment
`memory`	str	Yes	Agent's memory content
`history`	List[dict]	No	Previous step dictionaries
`context`	str	No	Task context

Scoring: - 1.0: Accurate and factual memory - 0.0: Inaccurate or fabricated memory

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import MemoryAccuracyGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = MemoryAccuracyGrader(model=model)

    result = await grader.aevaluate(
        observation="You see a closed cabinet with three drawers.",
        memory="The cabinet is closed and has three drawers.",
        context="Task: Inventory room objects"
    )

    print(f"Score: {result.score}")   # 1.0 - accurate
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The memory accurately reflects the observation by recording only factual details present in the input. The agent correctly notes that the cabinet is 'closed' and has 'three drawers,' which are both explicitly mentioned in the observation. There are no interpretations, assumptions, or fabrications included in the memory. The information is consistent with what was observed, and all recorded elements are grounded in the provided context. This demonstrates good accuracy as per the rubrics.

MemoryDetailPreservationGrader

Evaluates preservation of important details in stored memory.

Use this grader to:

Validate detail retention
Ensure actionable memory content
Debug information loss

Evaluation criteria: Storage of specific details, exact locations, numerical values, and important constraints.

Parameters:

Parameter	Type	Required	Description
`observation`	str	Yes	Agent's observation from environment
`memory`	str	Yes	Agent's memory content
`history`	List[dict]	No	Previous step dictionaries
`context`	str	No	Task context

Scoring: - 1.0: Important details preserved - 0.0: Important details lost or generalized

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import MemoryDetailPreservationGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = MemoryDetailPreservationGrader(model=model)

    result = await grader.aevaluate(
        observation="Cabinet 1 at coordinates (3.5, 2.1) contains 5 red apples.",
        memory="Cabinet 1 at (3.5, 2.1) has 5 red apples.",
        context="Task: Inventory items with precise locations"
    )

    print(f"Score: {result.score}")   # 1.0 - details preserved
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The agent successfully preserves all important details from the observation in its memory. It retains the specific location of Cabinet 1 with exact coordinates (3.5, 2.1), the quantity of items (5 apples), and the attribute (red). These details align directly with the rubrics for preserving spatial information, numerical values, and specific attributes. The memory is sufficiently detailed and actionable for future inventory-related tasks. Confidence is high because the preservation is explicit and matches the original observation precisely.

MemoryRetrievalEffectivenessGrader

Evaluates effectiveness of memory retrieval for planning and decision-making.

Use this grader to:

Assess memory system effectiveness
Detect failure to use available information
Identify repetitive behavior due to poor retrieval

Evaluation criteria: Memory retrieval relevance, usage in planning, and avoidance of redundant exploration.

Parameters:

Parameter	Type	Required	Description
`plan`	str	Yes	Agent's planning/reasoning
`observation`	str	Yes	Current environment observation
`memory`	str	Yes	Agent's memory content
`history`	List[dict]	No	Previous steps
`context`	str	No	Task context

Scoring: - 1.0: Effective memory retrieval - 0.0: Ineffective retrieval or failure to use memory

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import MemoryRetrievalEffectivenessGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = MemoryRetrievalEffectivenessGrader(model=model)

    result = await grader.aevaluate(
        plan="I will use the key from drawer 1 to unlock the door.",
        observation="You are standing in the room with a locked door.",
        memory="The key was found in drawer 1 in step 3.",
        context="Task: Unlock the door"
    )

    print(f"Score: {result.score}")   # 1.0 - effective retrieval
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The agent's plan effectively retrieves relevant information from memory by referencing the key found in drawer 1 during step 3. This demonstrates that the agent is using previously stored and correct information to inform its current action of unlocking the door. The plan aligns with the memory content, avoids repetition of past actions (no indication of trying other drawers), and is consistent with the observation of a locked door. The retrieval is current and accurate, showing strong memory effectiveness. Confidence is high because the connection between memory and plan is clear and directly supports the task at hand.

Plan Graders

PlanFeasibilityGrader

Evaluates logical soundness and feasibility of agent plans.

Use this grader to:

Validate agent planning capability
Ensure logical action sequences
Debug infeasible plans

Evaluation criteria: Causal logic, action order feasibility, executability, and prerequisite awareness.

Parameters:

Parameter	Type	Required	Description
`plan`	str	Yes	Agent's planning/reasoning
`observation`	str	Yes	Current environment observation
`memory`	str	Yes	Agent's memory content
`history`	List[dict]	No	Previous steps
`context`	str	No	Task context

Scoring: - 1.0: Feasible and logically sound - 0.0: Infeasible or illogical

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import PlanFeasibilityGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = PlanFeasibilityGrader(model=model)

    result = await grader.aevaluate(
        plan="I will first open the drawer to get the key, then use it to unlock the door.",
        observation="The drawer is closed. You don't have any items.",
        memory="The key is inside the drawer.",
        context="Task: Unlock the door"
    )

    print(f"Score: {result.score}")   # 1.0 - feasible
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The plan is logically sound and feasible. It respects causal logic by first retrieving the key (which is inside the drawer) before attempting to unlock the door. The sequence of actions—opening the drawer, obtaining the key, and then unlocking the door—is in a correct and necessary order. The plan also accounts for the current environment state: the drawer is closed, and the agent does not yet have the key. Therefore, opening the drawer is a valid prerequisite action. The steps are consistent with the goal of unlocking the door and are executable given the described scenario. Confidence is high because all rubrics for feasibility are clearly satisfied.

Reflection Graders

ReflectionAccuracyGrader

Evaluates accuracy of agent reflections based on actual observations.

Use this grader to:

Validate agent self-assessment accuracy
Ensure grounded reflections
Debug hallucination in reasoning

Evaluation criteria: Reflections only mention observed objects, states, and details.

Parameters:

Parameter	Type	Required	Description
`observation`	str	Yes	Agent's observation from environment
`reflection`	str	Yes	Agent's reflection on the situation
`history`	List[dict]	No	Previous steps
`context`	str	No	Task context

Scoring: - 1.0: Accurate and grounded reflection - 0.0: Contains fabrications

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ReflectionAccuracyGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ReflectionAccuracyGrader(model=model)

    result = await grader.aevaluate(
        observation="You see a closed cabinet.",
        reflection="I observed a closed cabinet.",
        context="Task: Find objects in the room"
    )

    print(f"Score: {result.score}")   # 1.0 - accurate
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The reflection accurately summarizes the observation without adding, omitting, or fabricating any information. The agent mentions only what was observed: a closed cabinet. It does not introduce any additional objects, states, or details that were not present in the original observation. This demonstrates full compliance with all rubrics for reflection accuracy. Confidence is high because the reflection is directly and explicitly grounded in the observation.

ReflectionOutcomeUnderstandingGrader

Evaluates correctness of action outcome interpretation in reflections.

Use this grader to:

Validate outcome interpretation accuracy
Detect fabricated or distorted understanding
Ensure evidence-based reasoning

Evaluation criteria: Strict factual accuracy checking of outcome interpretation.

Parameters:

Parameter	Type	Required	Description
`observation`	str	Yes	Agent's observation from environment
`reflection`	str	Yes	Agent's reflection on the situation
`history`	List[dict]	No	Previous steps
`context`	str	No	Task context

Scoring: - 1.0: Correct understanding - reflection accurately mirrors observation - 0.0: Poor understanding - factual distortion, failure misinterpretation, premature conclusions, scope overreach, inference leaps, fabrication, or format misinterpretation

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ReflectionOutcomeUnderstandingGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ReflectionOutcomeUnderstandingGrader(model=model)

    result = await grader.aevaluate(
        observation="The drawer is now open. You see a key inside.",
        reflection="I successfully opened the drawer and found a key inside.",
        context="Task: Find the key"
    )

    print(f"Score: {result.score}")   # 1.0 - correct understanding
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The reflection accurately mirrors the observation: 'The drawer is now open. You see a key inside.' The agent correctly interprets this as a successful action (opening the drawer) and identifies the presence of the key, which aligns with the task objective of finding the key. There is no factual distortion, no unsupported inference, and no overreach in interpreting partial information. The agent does not claim to have seen all contents or make premature conclusions about absence. The reasoning is directly supported by the observation and demonstrates good understanding of both the outcome and its implications.

ReflectionProgressAwarenessGrader

Evaluates accuracy of task progress awareness and sub-goal recognition.

Use this grader to:

Assess task progress tracking
Detect loop/stuck situations
Validate sub-goal awareness

Evaluation criteria: Correct identification of accomplishments, accurate distance-to-goal assessment, and recognition of all sub-goals.

Parameters:

Parameter	Type	Required	Description
`observation`	str	Yes	Agent's observation from environment
`reflection`	str	Yes	Agent's reflection on the situation
`history`	List[dict]	No	Previous steps
`context`	str	No	Task context (critical for sub-goal tracking)

Scoring: - 1.0: Accurate awareness - correctly identifies accomplishments, assesses distance to goal, recognizes all sub-goals - 0.0: Inaccurate awareness - misjudges progress, overlooks sub-goals, claims "almost done" while major requirements unmet

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import ReflectionProgressAwarenessGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = ReflectionProgressAwarenessGrader(model=model)

    result = await grader.aevaluate(
        observation="You have collected 3 out of 5 required items.",
        reflection="Good progress! I'm about halfway through the task. Still need to find 2 more items.",
        context="Task: Collect 5 specific items from different locations"
    )

    print(f"Score: {result.score}")   # 1.0 - accurate awareness
    print(f"Reason: {result.reason}")

asyncio.run(main())

Output:

Score: 1.0
Reason: The agent demonstrates accurate progress awareness by correctly identifying that it has collected 3 out of the 5 required items and acknowledging that 2 more are still needed. The reflection states, 'I'm about halfway through the task,' which is a realistic estimation given the current state. The agent does not overestimate its progress or ignore any critical sub-goals. It also shows awareness of the exact number of remaining tasks without substituting or omitting any specific item from the original task description. The reflection is concise but contains all necessary information to assess forward progress accurately. Confidence in this evaluation is high because the agent's self-assessment aligns with the observable facts and task constraints.

Observation Graders

ObservationInformationGainGrader

Measures information gain and redundancy in observation sequences.

Use this grader to:

Evaluate exploration efficiency
Detect redundant information gathering
Assess agent curiosity/exploration strategy

Evaluation criteria: Rewards novel observations, penalizes redundant ones based on similarity threshold.

Parameters:

Parameter	Type	Required	Description
`messages`	List[Dict[str, Any]]	Yes	Message list containing agent interactions
`similarity_threshold`	float	No	Redundancy threshold (default: 0.5)

Scoring: - 1.0: High information gain, low redundancy - 0.0: High redundancy, low information gain - Score based on average per-observation novelty with exponential penalty for similarity

Example:

import asyncio
from openjudge.graders.agent import ObservationInformationGainGrader

async def main():
    grader = ObservationInformationGainGrader(similarity_threshold=0.5)

    messages = [
        {"role": "assistant", "tool_calls": [{"id": "1", "function": {"name": "look", "arguments": '{}'}}]},
        {"role": "tool", "tool_call_id": "1", "content": "You see a red box."},
        {"role": "assistant", "tool_calls": [{"id": "2", "function": {"name": "look", "arguments": '{}'}}]},
        {"role": "tool", "tool_call_id": "2", "content": "You see a blue sphere."},
    ]

    result = await grader.aevaluate(messages=messages)

    print(f"Score: {result.score}")   # Higher = more novel observations
    print(f"Each turn similarity: {result.metadata['each_turn_similarity']}")

asyncio.run(main())

Output:

Score: 0.7857142857142857
Each turn similarity: [0.0, 0.42857142857142855]

Trajectory Graders

TrajectoryComprehensiveGrader

Comprehensive evaluation of complete agent trajectories.

Use this grader for:

End-to-end agent evaluation
Holistic trajectory assessment
Agent benchmark evaluation
Production quality monitoring

Evaluation criteria: Step contribution, relevance, accuracy, and efficiency across the complete trajectory.

Parameters:

Parameter	Type	Required	Description
`messages`	List[Dict[str, Any]]	Yes	Complete message history including system, user, assistant, tool
`resolution_threshold`	float	No	Threshold for success determination (default: 0.8)

Scoring:

Each dimension uses 1-5 scale in prompts, normalized to 0-1: - 5 → 1.0: Excellent - 4 → 0.75: Good - 3 → 0.5: Acceptable - 2 → 0.25: Poor - 1 → 0.0: Very poor

Overall score is the average across all steps and dimensions.

Example:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.agent import TrajectoryComprehensiveGrader

async def main():
    model = OpenAIChatModel(model="qwen3-32b")
    grader = TrajectoryComprehensiveGrader(
        model=model,
        resolution_threshold=0.75
    )

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Find Python files modified today"},
        {"role": "assistant", "content": "I'll search for Python files.",
         "tool_calls": [{"id": "1", "function": {"name": "search_files", "arguments": '{"pattern": "*.py"}'}}]},
        {"role": "tool", "tool_call_id": "1", "content": "Found: main.py, utils.py"},
        {"role": "assistant", "content": "I'll check their modification dates.",
         "tool_calls": [{"id": "2", "function": {"name": "get_file_info", "arguments": '{"files": ["main.py", "utils.py"]}'}}]},
        {"role": "tool", "tool_call_id": "2", "content": "main.py: today, utils.py: yesterday"},
        {"role": "assistant", "content": "Found 1 file modified today: main.py"}
    ]

    result = await grader.aevaluate(messages=messages)

    print(f"Overall Score: {result.score}")   # 0.0-1.0
    print(f"Is Resolved: {result.metadata['is_resolved']}")
    print(f"Avg Contribution: {result.metadata['avg_contribution']}")
    print(f"Avg Relevance: {result.metadata['avg_relevance']}")
    print(f"Avg Accuracy: {result.metadata['avg_accuracy']}")
    print(f"Avg Efficiency: {result.metadata['avg_efficiency']}")

    # Per-step details
    for step in result.metadata['step_evaluations']:
        print(f"Step {step['step_index']}: {step['step_reason']}")

asyncio.run(main())

Output:

Overall Score: 1.0
Is Resolved: True
Avg Contribution: 1.0
Avg Relevance: 1.0
Avg Accuracy: 1.0
Avg Efficiency: 1.0
Step 0: This step searches for all Python files (files ending with .py) in the system. It is a foundational step that identifies the set of candidate files to evaluate for modification date. Without this step, there would be no list of files to analyze further. The pattern used is accurate and directly relevant to the user's query.
Step 1: This step retrieves file metadata (specifically modification dates) for the identified Python files. This information is essential to determine which files were modified today. The result correctly distinguishes between 'today' and 'yesterday', enabling the final answer to be constructed accurately. This is a critical follow-up to Step 0 and directly supports the user's goal.

Summary

Agent graders provide comprehensive evaluation across all aspects of agent behavior—from individual actions and tool calls to memory management, planning, reflection, and complete trajectories.

Key capabilities:

Process-level debugging — Identify specific failure points in tool selection, parameter extraction, or reasoning
Outcome-level assessment — Measure overall task success and trajectory quality
Systematic improvement — Combine multiple graders to diagnose where agents fail, why they fail, and how to improve them

Build complete evaluation pipelines by combining graders from different categories to match your agent architecture and debugging needs.

Next Steps

Multimodal Graders — Evaluate image and vision tasks
Code & Math Graders — Evaluate code generation and mathematical problem-solving
Build Reward for Training — Combine multiple graders for RLHF rewards