GRPOAdvantage

GRPOAdvantage

GRPO (Group Relative Policy Optimization) advantage function calculates advantages by subtracting the group mean.

Usage Example

from twinkle.advantage import GRPOAdvantage

advantage_fn = GRPOAdvantage()

# Assume 2 prompts, each generating 4 samples
rewards = [0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]  # 8 reward values
advantages = advantage_fn(rewards, num_generations=4, scale='group')

# Advantages will be each group minus the group mean:
# Group 1: [0.0-0.5, 1.0-0.5, 0.0-0.5, 1.0-0.5] = [-0.5, 0.5, -0.5, 0.5]
# Group 2: [1.0-0.25, 0.0-0.25, 0.0-0.25, 0.0-0.25] = [0.75, -0.25, -0.25, -0.25]

How It Works

GRPO groups samples (each group corresponds to multiple generations from one prompt), then within each group:

Calculate the group mean reward
Advantage for each sample = reward - group mean
Optionally normalize the advantage values

This method:

Reduces variance and improves training stability
Performs relative comparisons within groups, better aligned with relative nature of human preferences
Avoids the impact of reward scale

Complete Training Example

Using the advantage function in GRPO training:

from twinkle.advantage import GRPOAdvantage
from twinkle.model import TransformersModel
from twinkle.sampler import vLLMSampler

# Create components
actor = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')
reward_fn = ...
advantage_fn = GRPOAdvantage()

# Training loop
for batch in dataloader:
    # Sample generation
    sample_response = sampler.sample(batch, num_samples=4)
    input_data = [seq.new_input_feature for response in sample_response for seq in response.sequences]
    ...
    rewards = reward_fn(...)

    # Calculate advantages
    advantages = advantage_fn(rewards, num_generations=4)

    # 4. Policy optimization
    loss = actor.forward_backward(
        inputs=input_data,
        advantages=advantages
    )
    actor.clip_grad_and_step()

The GRPO method is simple and efficient, suitable for most RLHF training scenarios.

← Advantage

RLOOAdvantage →

No results found

Usage Example

How It Works

Complete Training Example