Reward | Twinkle

Reward

Mon, 01 Jan 0001 00:00:00 +0000

Reward functions are components in RLHF training used to evaluate the quality of model outputs. They calculate reward scores based on model-generated trajectories to guide policy learning.

Basic Interface

class Reward:

 def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
 """
 Calculate reward values

 Args:
 trajectories: List of model-generated trajectories
 ground_truths: List of ground truth trajectories

 Returns:
 List of reward values
 """
 ...

MathReward

The math reward function evaluates the correctness of answers to mathematical problems.

from twinkle.reward import MathReward

reward_fn = MathReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

FormatReward

The format reward function checks whether the output conforms to a specified format.

from twinkle.reward import FormatReward

reward_fn = FormatReward()
rewards = reward_fn(trajectories, ground_truths)

Custom Reward Functions

You can create custom rewards by inheriting from the Reward base class or using functions:

from twinkle.reward import Reward
from twinkle.data_format import Trajectory
from typing import List

class CustomReward(Reward):

 def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
 rewards = []
 for traj, gt in zip(trajectories, ground_truths):
 # Custom evaluation logic
 score = self._evaluate(traj, gt)
 rewards.append(score)
 return rewards

 def _evaluate(self, traj, gt):
 # Implement specific evaluation logic
 ...

Or using a function:

def my_reward(trajectories, ground_truths):
 return [1.0 if t == gt else 0.0 for t, gt in zip(trajectories, ground_truths)]

# Use in training
rewards = my_reward(generated, ground_truths)

Usage Scenarios

Typical workflow of reward functions in RLHF training:

from twinkle.sampler import vLLMSampler
from twinkle.reward import MathReward
from twinkle.advantage import GRPOAdvantage

sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = MathReward()
advantage_fn = GRPOAdvantage()

for batch in dataloader:
 # 1. Sample and generate multiple candidate answers
 response = sampler.sample(batch, num_samples=4)

 # 2. Evaluate quality using reward function
 rewards = reward_fn(response.trajectories, batch.ground_truths)

 # 3. Calculate advantages
 advantages = advantage_fn(rewards, num_generations=4)

 # 4. Update policy using advantage values
 ...

The design of reward functions is crucial for RLHF effectiveness. A good reward function should accurately reflect the task objectives and provide clear learning signals.

GSM8K Reward

Mon, 01 Jan 0001 00:00:00 +0000

Reward functions specifically designed for evaluating GSM8K math problem solutions.

GSM8KAccuracyReward

Evaluates the correctness of GSM8K answers by extracting boxed or hash-formatted (####) answers and performing numeric/string comparison.

from twinkle.reward import GSM8KAccuracyReward

reward_fn = GSM8KAccuracyReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

The reward function:

Extracts the answer from \boxed{...} or #### ... format in the model’s completion
Extracts the ground truth answer from the reference trajectory
Performs numeric comparison (with tolerance) or exact string matching

GSM8KFormatReward

Checks whether the model output contains a properly formatted answer section.

from twinkle.reward import GSM8KFormatReward

reward_fn = GSM8KFormatReward()
rewards = reward_fn(trajectories, ground_truths)
# rewards: List[float], 1.0 if format is valid, 0.0 otherwise

Use GSM8KAccuracyReward and GSM8KFormatReward together as a composite reward for GRPO training on math problem solving tasks.

MultiModal Reward

Mon, 01 Jan 0001 00:00:00 +0000

Reward function for evaluating multimodal visual question answering (VQA) tasks.

MultiModalAccuracyReward

Evaluates the correctness of multimodal VQA answers with a fallback to symbolic math verification.

from twinkle.reward import MultiModalAccuracyReward

reward_fn = MultiModalAccuracyReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

The reward function:

Extracts the model’s answer from the completion text
Compares with ground truth using exact string matching
Falls back to math_verify for symbolic expression comparison when string matching fails

Designed for visual reasoning tasks such as CLEVR, where answers may be numeric, boolean, or short text.

OlympiadBench Reward

Mon, 01 Jan 0001 00:00:00 +0000

A family of reward functions for evaluating OlympiadBench math and physics competition problems.

OlympiadBenchAccuracyReward

Evaluates answer correctness with support for LaTeX normalization, numeric tolerance, and partial matching.

from twinkle.reward import OlympiadBenchAccuracyReward

reward_fn = OlympiadBenchAccuracyReward()
rewards = reward_fn(generated_trajectories, ground_truth_trajectories)
# rewards: List[float], 1.0 for correct, 0.0 for incorrect

The reward function:

Extracts boxed answers from \boxed{...} with nested brace handling
Normalizes both prediction and ground truth (LaTeX, units, fractions)
Compares via normalized string matching or numeric comparison with tolerance

OlympiadBenchFormatReward

Validates the structural format of model outputs.

from twinkle.reward import OlympiadBenchFormatReward

reward_fn = OlympiadBenchFormatReward()
rewards = reward_fn(trajectories, ground_truths)
# rewards: List[float], scores based on format quality

Scoring criteria:

Presence of \boxed{...} answer
Answer positioning (should appear near the end)
Answer uniqueness and consistency

OlympiadBenchQualityReward

A composite quality reward combining multiple aspects of response quality.

from twinkle.reward import OlympiadBenchQualityReward

reward_fn = OlympiadBenchQualityReward()
rewards = reward_fn(trajectories, ground_truths)

Quality dimensions:

Reasoning structure: Detection of step-by-step reasoning patterns
Length appropriateness: Smooth penalty curve for responses that are too short or too long
Content uniqueness: Penalizes repetitive content within the response

These rewards can be used individually or combined as a composite reward for GRPO training on olympiad-level math and physics problems.