Task Judger evaluates agent outputs and assigns rewards during training. This page covers built-in judgers for common scenarios and how to create custom judgers for specific evaluation needs.
When to use the task judger
- Is task judger necessary for all tasks? No:
- There are two options to generate reward:
- Compute reward inside the user-defined workflow (
WorkflowOutput.reward is not None) - Compute reward outside the user-defined workflow (
WorkflowOutput.reward is None)
- Compute reward inside the user-defined workflow (
- Task judger is how AgentJet handles out-of-workflow reward computation.
- Task judger will be Disabled and Ignored when the user-defined workflow returned an effective
WorkflowOutput.rewardandWorkflowOutput.reward != None - Task judger will be Enabled when the user-defined workflow returned
WorkflowOutput.reward = None.
- There are two options to generate reward:
- When to use the task judger:
- When the user plan to re-used the reward function in multiple other workflows in the future.
- When the user want to decouple rollout and reward computation logic.
- When the user want to use our OpenJudge integration to generate Auto Rubrics reward.
Overview
A Task Judger evaluates the agent's execution results and returns two values:
| Return Value | Type | Description |
|---|---|---|
raw_reward |
float |
Numerical score representing output quality (often 0.0 to 1.0) |
is_success |
bool |
Whether the task was successfully completed |
These values guide the RL training process, helping agents learn which behaviors produce better outcomes.
Base Interface
All Task Judgers inherit from BaseJudge and implement the compute_reward method:
from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask
class BaseJudge:
def __init__(self, config):
self.config = config
def compute_reward(
self,
workflow_task: WorkflowTask,
workflow_output: WorkflowOutput
) -> tuple[float, bool]:
"""
Args:
workflow_task: Contains the task data, including metadata with reference answers
workflow_output: Contains the agent's output, including metadata with generated answers
Returns:
tuple: (raw_reward: float, is_success: bool)
"""
raise NotImplementedError
Built-in Task Judgers
AgentJet provides three built-in judgers for common evaluation scenarios:
1. MathAnswerAsJudge
Evaluates mathematical answers by exact string matching, designed for tasks where answers are formatted in LaTeX \boxed{} notation.
When to use
- Math problem solving tasks
- Tasks with deterministic, exact answers
- Answers formatted as
\boxed{result}
- Extracts the answer from
\boxed{...}in the agent's output - Compares with the reference answer from
workflow_task.task.metadata["answer"] - Returns
(1.0, True)for correct answers,(0.0, False)otherwise
Required metadata:
| Field | Source | Description |
|---|---|---|
final_answer |
workflow_output.metadata |
Agent's answer with \boxed{} format |
answer |
workflow_task.task.metadata |
Reference answer |
2. CountdownAnswerAsJudge
Evaluates mathematical equations with partial credit for proper formatting.
When to use
- Number puzzle tasks (e.g., Countdown game)
- Tasks where partial credit is appropriate
- Need to reward proper formatting even when answer is wrong
3. EnvServiceJudge
Delegates evaluation to an external environment service, useful for complex interactive environments.
When to use
- Tasks with external simulators (e.g., AppWorld)
- Interactive environments with built-in evaluators
ajet:
task_judge:
judge_type: customized_protocol
judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge
Creating Custom Task Judgers
For specialized evaluation needs, create your own judger by inheriting BaseJudge:
- Implement Your Judger Create a new file with your custom judger class.
- Configure Your Judger Point to your custom class in the YAML configuration.
- Pass Data to the Judger Populate `workflow_output.metadata` with the data your judger needs.
Step 1: Implement Your Judger
from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask
class MyCustomJudge(BaseJudge):
def __init__(self, config):
super().__init__(config)
self.threshold = 0.8
def compute_reward(
self,
workflow_task: WorkflowTask,
workflow_output: WorkflowOutput
) -> tuple[float, bool]:
agent_answer = workflow_output.metadata.get("final_answer", "")
reference_answer = workflow_task.task.metadata.get("answer", "")
similarity = self._compute_similarity(agent_answer, reference_answer)
is_success = similarity >= self.threshold
return similarity, is_success
def _compute_similarity(self, text1: str, text2: str) -> float:
return len(set(text1.split()) & set(text2.split())) / max(
len(text1.split()), len(text2.split()), 1
)
Step 2: Configure Your Judger
ajet:
task_judge:
judge_type: customized_protocol
judge_protocol: tutorial.my_task.my_judge->MyCustomJudge
Step 3: Pass Data to the Judger
class MyWorkflow(Workflow):
async def execute(self, task: WorkflowTask, tuner: AjetTuner) -> WorkflowOutput:
final_answer = await self.agent.reply(msg)
return WorkflowOutput(
reward=None, # Will be filled by the judger
metadata={
"final_answer": final_answer,
}
)
Configuration Summary
ajet:
task_judge:
judge_type: customized_protocol
judge_protocol: ajet.task_judge.<module>-><ClassName>