RLOOAdvantage
RLOO (Reinforcement Learning with Leave-One-Out) advantage function uses leave-one-out method to calculate baselines.
Usage Example
from twinkle.advantage import RLOOAdvantage
advantage_fn = RLOOAdvantage()
rewards = [0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]
advantages = advantage_fn(rewards, num_generations=4)
# For each sample, the baseline is the mean of all other samples
# First sample in first group: 0.0 - mean([1.0, 0.0, 1.0]) = 0.0 - 0.667 = -0.667
# ...
How It Works
For each sample, RLOO:
- Calculates the mean reward of all other samples in the group (leave-one-out baseline)
- Advantage = sample reward - leave-one-out baseline
- Optionally normalizes the values
RLOO advantages:
- Avoids using the sample’s own information as baseline, reducing bias
- More accurate counterfactual baseline estimation
- Better performance when there are more samples
Training Example
from twinkle.advantage import RLOOAdvantage
from twinkle.model import TransformersModel
from twinkle.sampler import vLLMSampler
from twinkle.reward import MathReward
# Create components
actor = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = MathReward()
advantage_fn = RLOOAdvantage()
dataloader = ...
# Training loop
for batch in dataloader:
# 1. Sample generation (generate more samples to improve RLOO effectiveness)
response = sampler.sample(batch, num_samples=8)
# 2. Calculate rewards
rewards = reward_fn(response.trajectories, batch.ground_truths)
# 3. Calculate advantages
advantages = advantage_fn(rewards, num_generations=8)
# 4. Policy optimization
loss = actor.forward_backward(
inputs=response.inputs,
advantages=advantages
)
actor.clip_grad_and_step()
RLOO is theoretically superior but requires more samples (recommend 8 or more samples per prompt).