Task Judger evaluates agent outputs and assigns rewards during training. This page covers built-in judgers for common scenarios and how to create custom judgers for specific evaluation needs.

When to use the task judger

  • Is task judger necessary for all tasks? No:
    • There are two options to generate reward:
      • Compute reward inside the user-defined workflow (WorkflowOutput.reward is not None)
      • Compute reward outside the user-defined workflow (WorkflowOutput.reward is None)
    • Task judger is how AgentJet handles out-of-workflow reward computation.
    • Task judger will be Disabled and Ignored when the user-defined workflow returned an effective WorkflowOutput.reward and WorkflowOutput.reward != None
    • Task judger will be Enabled when the user-defined workflow returned WorkflowOutput.reward = None.
  • When to use the task judger:
    • When the user plan to re-used the reward function in multiple other workflows in the future.
    • When the user want to decouple rollout and reward computation logic.
    • When the user want to use our OpenJudge integration to generate Auto Rubrics reward.

Overview

A Task Judger evaluates the agent's execution results and returns two values:

Return Value Type Description
raw_reward float Numerical score representing output quality (often 0.0 to 1.0)
is_success bool Whether the task was successfully completed

These values guide the RL training process, helping agents learn which behaviors produce better outcomes.

Base Interface

All Task Judgers inherit from BaseJudge and implement the compute_reward method:

base_judge.py
from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask

class BaseJudge:
    def __init__(self, config):
        self.config = config

    def compute_reward(
        self,
        workflow_task: WorkflowTask,
        workflow_output: WorkflowOutput
    ) -> tuple[float, bool]:
        """
        Args:
            workflow_task: Contains the task data, including metadata with reference answers
            workflow_output: Contains the agent's output, including metadata with generated answers

        Returns:
            tuple: (raw_reward: float, is_success: bool)
        """
        raise NotImplementedError

Built-in Task Judgers

AgentJet provides three built-in judgers for common evaluation scenarios:

1. MathAnswerAsJudge

Evaluates mathematical answers by exact string matching, designed for tasks where answers are formatted in LaTeX \boxed{} notation.

When to use

  • Math problem solving tasks
  • Tasks with deterministic, exact answers
  • Answers formatted as \boxed{result}
config.yaml
ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.example_math_agent.math_answer_as_judge->MathAnswerAsJudge
  1. Extracts the answer from \boxed{...} in the agent's output
  2. Compares with the reference answer from workflow_task.task.metadata["answer"]
  3. Returns (1.0, True) for correct answers, (0.0, False) otherwise

Required metadata:

Field Source Description
final_answer workflow_output.metadata Agent's answer with \boxed{} format
answer workflow_task.task.metadata Reference answer

2. CountdownAnswerAsJudge

Evaluates mathematical equations with partial credit for proper formatting.

When to use

  • Number puzzle tasks (e.g., Countdown game)
  • Tasks where partial credit is appropriate
  • Need to reward proper formatting even when answer is wrong
config.yaml
ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.example_countdown.countdown_answer_as_judge->CountdownAnswerAsJudge
Score Condition
0.0 Invalid or missing answer
0.1 Properly formatted equation but wrong result
1.0 Correct equation and result

3. EnvServiceJudge

Delegates evaluation to an external environment service, useful for complex interactive environments.

When to use

  • Tasks with external simulators (e.g., AppWorld)
  • Interactive environments with built-in evaluators
config.yaml
ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge

Creating Custom Task Judgers

For specialized evaluation needs, create your own judger by inheriting BaseJudge:

Custom Judger Steps
  1. Implement Your Judger Create a new file with your custom judger class.
  2. Configure Your Judger Point to your custom class in the YAML configuration.
  3. Pass Data to the Judger Populate `workflow_output.metadata` with the data your judger needs.

Step 1: Implement Your Judger

tutorial/my_task/my_judge.py
from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask

class MyCustomJudge(BaseJudge):
    def __init__(self, config):
        super().__init__(config)
        self.threshold = 0.8

    def compute_reward(
        self,
        workflow_task: WorkflowTask,
        workflow_output: WorkflowOutput
    ) -> tuple[float, bool]:
        agent_answer = workflow_output.metadata.get("final_answer", "")
        reference_answer = workflow_task.task.metadata.get("answer", "")

        similarity = self._compute_similarity(agent_answer, reference_answer)
        is_success = similarity >= self.threshold
        return similarity, is_success

    def _compute_similarity(self, text1: str, text2: str) -> float:
        return len(set(text1.split()) & set(text2.split())) / max(
            len(text1.split()), len(text2.split()), 1
        )

Step 2: Configure Your Judger

config.yaml
ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.my_task.my_judge->MyCustomJudge

Step 3: Pass Data to the Judger

workflow.py
class MyWorkflow(Workflow):
    async def execute(self, task: WorkflowTask, tuner: AjetTuner) -> WorkflowOutput:
        final_answer = await self.agent.reply(msg)
        return WorkflowOutput(
            reward=None,  # Will be filled by the judger
            metadata={
                "final_answer": final_answer,
            }
        )

Configuration Summary

config.yaml
ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: ajet.task_judge.<module>-><ClassName>

Next Steps