Task Judger

Task Judger evaluates agent outputs and assigns rewards during training. This page covers built-in judgers for common scenarios and how to create custom judgers for specific evaluation needs.

When to use the task judger

Is task judger necessary for all tasks? No:
- There are two options to generate reward:
  - Compute reward inside the user-defined workflow (WorkflowOutput.reward is not None)
  - Compute reward outside the user-defined workflow (WorkflowOutput.reward is None)
- Task judger is how AgentJet handles out-of-workflow reward computation.
- Task judger will be Disabled and Ignored when the user-defined workflow returned an effective WorkflowOutput.reward and WorkflowOutput.reward != None
- Task judger will be Enabled when the user-defined workflow returned WorkflowOutput.reward = None.
When to use the task judger:
- When the user plan to re-used the reward function in multiple other workflows in the future.
- When the user want to decouple rollout and reward computation logic.
- When the user want to use our OpenJudge integration to generate Auto Rubrics reward.

Overview

A Task Judger evaluates the agent's execution results and returns two values:

Return Value	Type	Description
`raw_reward`	`float`	Numerical score representing output quality (often 0.0 to 1.0)
`is_success`	`bool`	Whether the task was successfully completed

These values guide the RL training process, helping agents learn which behaviors produce better outcomes.

Base Interface

All Task Judgers inherit from BaseJudge and implement the compute_reward method:

base_judge.py

from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask

class BaseJudge:
    def __init__(self, config):
        self.config = config

    def compute_reward(
        self,
        workflow_task: WorkflowTask,
        workflow_output: WorkflowOutput
    ) -> tuple[float, bool]:
        """
        Args:
            workflow_task: Contains the task data, including metadata with reference answers
            workflow_output: Contains the agent's output, including metadata with generated answers

        Returns:
            tuple: (raw_reward: float, is_success: bool)
        """
        raise NotImplementedError

Built-in Task Judgers

AgentJet provides three built-in judgers for common evaluation scenarios:

1. MathAnswerAsJudge

Evaluates mathematical answers by exact string matching, designed for tasks where answers are formatted in LaTeX \boxed{} notation.

When to use

Math problem solving tasks
Tasks with deterministic, exact answers
Answers formatted as \boxed{result}

ConfigurationHow it works

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.example_math_agent.math_answer_as_judge->MathAnswerAsJudge

Extracts the answer from \boxed{...} in the agent's output
Compares with the reference answer from workflow_task.task.metadata["answer"]
Returns (1.0, True) for correct answers, (0.0, False) otherwise

Required metadata:

Field	Source	Description
`final_answer`	`workflow_output.metadata`	Agent's answer with `\boxed{}` format
`answer`	`workflow_task.task.metadata`	Reference answer

2. CountdownAnswerAsJudge

Evaluates mathematical equations with partial credit for proper formatting.

When to use

Number puzzle tasks (e.g., Countdown game)
Tasks where partial credit is appropriate
Need to reward proper formatting even when answer is wrong

ConfigurationScoring

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.example_countdown.countdown_answer_as_judge->CountdownAnswerAsJudge

Score	Condition
`0.0`	Invalid or missing answer
`0.1`	Properly formatted equation but wrong result
`1.0`	Correct equation and result

3. EnvServiceJudge

Delegates evaluation to an external environment service, useful for complex interactive environments.

When to use

Tasks with external simulators (e.g., AppWorld)
Interactive environments with built-in evaluators

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge

Creating Custom Task Judgers

For specialized evaluation needs, create your own judger by inheriting BaseJudge:

Custom Judger Steps

Implement Your Judger Create a new file with your custom judger class.
Configure Your Judger Point to your custom class in the YAML configuration.
Pass Data to the Judger Populate `workflow_output.metadata` with the data your judger needs.

Step 1: Implement Your Judger

tutorial/my_task/my_judge.py

from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask

class MyCustomJudge(BaseJudge):
    def __init__(self, config):
        super().__init__(config)
        self.threshold = 0.8

    def compute_reward(
        self,
        workflow_task: WorkflowTask,
        workflow_output: WorkflowOutput
    ) -> tuple[float, bool]:
        agent_answer = workflow_output.metadata.get("final_answer", "")
        reference_answer = workflow_task.task.metadata.get("answer", "")

        similarity = self._compute_similarity(agent_answer, reference_answer)
        is_success = similarity >= self.threshold
        return similarity, is_success

    def _compute_similarity(self, text1: str, text2: str) -> float:
        return len(set(text1.split()) & set(text2.split())) / max(
            len(text1.split()), len(text2.split()), 1
        )

Step 2: Configure Your Judger

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.my_task.my_judge->MyCustomJudge

Step 3: Pass Data to the Judger

workflow.py

class MyWorkflow(Workflow):
    async def execute(self, task: WorkflowTask, tuner: AjetTuner) -> WorkflowOutput:
        final_answer = await self.agent.reply(msg)
        return WorkflowOutput(
            reward=None,  # Will be filled by the judger
            metadata={
                "final_answer": final_answer,
            }
        )

Configuration Summary

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: ajet.task_judge.<module>-><ClassName>

Next Steps

Configuration

Complete reference for all configuration options.

$\text{[math]}$

Math Agent

See MathAnswerAsJudge in a complete training example.