Train judge models using three approaches: SFT for foundation learning, Bradley-Terry for scalar preference scoring, and GRPO for generative evaluation with reasoning.

Terminology: Judge Model vs Reward Model

In OpenJudge, we use judge model to refer to models trained for evaluation. This is the same concept as reward model commonly used in RLHF literature. Both terms describe models that assess and score AI outputs—we prefer "judge model" to emphasize the evaluation and assessment role.

Overview

OpenJudge provides training pipelines for building custom judge models. Each method serves different use cases:

Method Output Type Training Data Interpretable Best For
SFT Generative (text) Demonstrations ✅ Yes Model initialization, response generation
Bradley-Terry Scalar score Preference pairs ❌ No RLHF judge modeling, ranking
GRPO Generative (text) Labeled responses ✅ Yes Interpretable evaluation with reasoning

Common Requirements:

pip install verl==0.6.1

Datasets

All training datasets are available on HuggingFace:

Method Dataset Link
SFT HelpSteer2 high-quality responses 🔗 train_rm/sft
Bradley-Terry HelpSteer2 preference pairs 🔗 train_rm/bradley_terry
GRPO Pointwise RewardBench2 for scoring 🔗 train_rm/grpo/pointwise
GRPO Pairwise RewardBench2 for comparison 🔗 train_rm/grpo/pairwise

SFT Training

Supervised Fine-Tuning learns from high-quality demonstration data. Use SFT to initialize models before preference training or when you have expert-labeled responses.

Training Objective

\[\mathcal{L} = -\sum_{t} \log P(y_t | y_{<t}, x)\]

Quick Start

cd cookbooks/training_judge_model/sft
bash run_sft_rm.sh

Data Format

Parquet files with messages column (compatible with tokenizer.apply_chat_template):

import pandas as pd

messages = [
    {"role": "user", "content": "What are the benefits of exercise?"},
    {"role": "assistant", "content": "Regular exercise improves cardiovascular health..."}
]

df = pd.DataFrame({"messages": [messages]})
df.to_parquet("train.parquet")

Configuration

Key parameters in run_sft_rm.sh:

Parameter Default Description
MODEL_PATH ./models/Qwen3-14B Base model path
TRAIN_BATCH_SIZE 96 Global batch size
MICRO_BATCH_SIZE 12 Per-GPU micro batch
MAX_LENGTH 4096 Maximum sequence length
SP_SIZE 8 Sequence parallel size
TOTAL_EPOCHS 1 Training epochs

Data configuration:

data:
  train_batch_size: 96
  micro_batch_size: 12
  max_length: 4096
  truncation: right
  multiturn:
    enable: true
    messages_key: messages

Metrics

Metric Description
train/loss Cross-entropy loss
val/loss Validation loss

Full Documentation


Bradley-Terry Training

Bradley-Terry training learns to rank responses by modeling preference probability. Use when you have binary preference data (chosen vs. rejected).

Training Objective

\[\mathcal{L} = -\log \sigma(r_{\text{chosen}} - r_{\text{rejected}})\]

Quick Start

cd cookbooks/training_judge_model/bradley-terry
bash run_bt_rm.sh

Data Format

Parquet files with chosen and rejected columns (JSON strings of message lists):

import json
import pandas as pd

chosen = json.dumps([
    {"role": "user", "content": "What are the benefits of exercise?"},
    {"role": "assistant", "content": "Regular exercise improves cardiovascular health..."}
])
rejected = json.dumps([
    {"role": "user", "content": "What are the benefits of exercise?"},
    {"role": "assistant", "content": "Exercise is good for you."}
])

df = pd.DataFrame({"chosen": [chosen], "rejected": [rejected]})
df.to_parquet("train.parquet")

Configuration

Key parameters in run_bt_rm.sh:

Parameter Default Description
MODEL_PATH ./models/Qwen3-8B Base model path
TRAIN_BATCH_SIZE 256 Global batch size
MICRO_BATCH_SIZE 1 Per-GPU micro batch
MAX_LENGTH 4096 Maximum sequence length
LR 2e-6 Learning rate
TOTAL_EPOCHS 3 Training epochs
STRATEGY fsdp2 FSDP strategy (fsdp or fsdp2)

Optimizer configuration:

optim:
  lr: 2e-6
  weight_decay: 0.001
  warmup_steps_ratio: 0.03
  clip_grad: 2.0
  lr_scheduler: cosine

Metrics

Metric Description
train/loss Bradley-Terry loss
train/accuracy Preference prediction accuracy
val/loss Validation loss
val/accuracy Validation accuracy

Full Documentation


GRPO Training (Reinforcement Learning)

Group Relative Policy Optimization trains generative judges that output structured evaluations with reasoning. No separate critic model required.

Training Objective

\[\mathcal{L} = -\mathbb{E}\left[\sum_{i=1}^{G} A_i \log \pi_\theta(y_i|x)\right]\]

Training Modes

Rate individual responses on a 0-4 helpfulness scale.

Output Format:

<think>Analysis of response quality...</think>
<score>3</score>

Compare two responses and select the better one.

Output Format:

<think>Comparison of Response A vs B...</think>
<better>A</better>

Prerequisites

GRPO requires a Ray cluster:

# Start Ray head node
ray start --head --port=6379 --dashboard-port=8265

# Verify cluster
ray status

Quick Start

cd cookbooks/training_judge_model/grpo
bash pointwise/run_pointwise.sh
cd cookbooks/training_judge_model/grpo
bash pairwise/run_pairwise.sh

Configuration

Override defaults with environment variables:

MODEL_PATH=Qwen/Qwen3-32B \
N_GPUS_PER_NODE=8 \
RAY_ADDRESS=http://localhost:8265 \
bash pointwise/run_pointwise.sh
Parameter Default Description
MODEL_PATH Qwen/Qwen3-8B Base model
RAY_ADDRESS http://127.0.0.1:8265 Ray dashboard
N_GPUS_PER_NODE 8 GPUs per node
TRAIN_BATCH_SIZE 96 Global batch size
ROLLOUT_N 4 Samples per prompt
KL_LOSS_COEF 0.001 KL divergence coefficient

Metrics

Metric Description
train/reward_mean Average reward
train/kl_divergence KL from reference model
train/policy_loss Policy gradient loss

Full Documentation


Troubleshooting

OOM (Out of Memory)

  • Reduce MICRO_BATCH_SIZE or micro_batch_size_per_gpu
  • Enable enable_gradient_checkpointing
  • Reduce max_length
  • Enable cpu_offload (SFT/BT) or param_offload (GRPO)

Training Instability

  • Lower learning rate
  • Increase clip_grad value
  • Check data quality and format

Ray Connection Issues (GRPO)

  • Verify Ray is running: ray status
  • Check RAY_ADDRESS is correct
  • Ensure firewall allows ports 6379 and 8265

Next Steps