trinity.algorithm.advantage_fn.grpo_advantage module

GRPO advantage computation

class trinity.algorithm.advantage_fn.grpo_advantage.GRPOAdvantageFn(epsilon: float = 1e-06)[source]

Bases: AdvantageFn

GRPO advantage computation

__init__(epsilon: float = 1e-06) → None[source]

classmethod default_args() → Dict[source]

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.grpo_advantage.GRPOGroupedAdvantage(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None)[source]

Bases: GroupAdvantage

An example AddStrategy that calculates GRPO advantages.

__init__(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None) → None[source]

Initialize the GRPO advantage function.

Parameters:

epsilon (float) – A small value to avoid division by zero.
std_threshold (Optional[float]) – If provided, groups with a reward standard deviation equal or below this threshold will be skipped.
duplicate_experiences (bool) – If True, allows duplicate experiences to keep the original experience count. Only used when std_threshold is not None (https://hkunlp.github.io/blog/2025/Polaris).
rank_penalty (Optional[float]) – A penalty applied to the rank of rewards to correct for bias (https://arxiv.org/pdf/2506.02355).

group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

process(exps)[source]

Process a list of experiences and return a transformed list.

Parameters:: exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
Returns:: A tuple containing the processed list of experiences and a dictionary of metrics.
Return type:: Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]

Returns:: The default init arguments for the advantage function.
Return type:: Dict