trinity.algorithm.advantage_fn package

trinity.algorithm.advantage_fn package#

Submodules#

Module contents#

class trinity.algorithm.advantage_fn.AdvantageFn[source]#

Bases: ABC

abstract classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

classmethod compute_in_trainer() → bool[source]#: Whether the advantage should be computed in the trainer loop.

class trinity.algorithm.advantage_fn.GroupAdvantage[source]#

Bases: AdvantageFn, ExperienceOperator

For group-based advantages calculation.

abstract group_experiences(exps: List[Experience]) → Dict[str, List[Experience]][source]#

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

abstract calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]#

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

process(exps: List[Experience]) → Tuple[List[Experience], Dict][source]#

Process a list of experiences and return a transformed list.

Parameters:: exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
Returns:: A tuple containing the processed list of experiences and a dictionary of metrics.
Return type:: Tuple[List[Experience], Dict]

classmethod compute_in_trainer() → bool[source]#: Whether the advantage should be computed in the trainer loop.

class trinity.algorithm.advantage_fn.PPOAdvantageFn(gamma: float = 1.0, lam: float = 1.0)[source]#

Bases: AdvantageFn

__init__(gamma: float = 1.0, lam: float = 1.0) → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.GRPOAdvantageFn(epsilon: float = 1e-06)[source]#

Bases: AdvantageFn

GRPO advantage computation

__init__(epsilon: float = 1e-06) → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.GRPOGroupedAdvantage(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group')[source]#

Bases: GroupAdvantage

An advantage class that calculates GRPO advantages.

__init__(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group') → None[source]#

Initialize the GRPO advantage function.

Parameters:

epsilon (float) – A small value to avoid division by zero.
std_threshold (Optional[float]) – If provided, groups with a reward standard deviation equal or below this threshold will be skipped.
duplicate_experiences (bool) – If True, allows duplicate experiences to keep the original experience count. Only used when std_threshold is not None (https://hkunlp.github.io/blog/2025/Polaris).
rank_penalty (Optional[float]) – A penalty applied to the rank of rewards to correct for bias (https://arxiv.org/pdf/2506.02355).
std_cal_level (str) – The scope for calculating the reward standard deviation for normalization. Can be ‘group’ (default, std is calculated per group) or ‘batch’ (std is calculated across the entire batch). The mean is always calculated per group. Calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping(https://arxiv.org/pdf/2508.08221v1).

group_experiences(exps)[source]#

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience], precomputed_std: Tensor | None = None) → Tuple[List[Experience], Dict][source]#

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

process(exps)[source]#

Process a list of experiences and return a transformed list.

Parameters:: exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
Returns:: A tuple containing the processed list of experiences and a dictionary of metrics.
Return type:: Tuple[List[Experience], Dict]

classmethod default_args() → dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.StepWiseGRPOAdvantageFn(epsilon: float = 1e-06, enable_step_norm: bool = False, std_cal_level: str = 'group', std_threshold: float | None = None, **kwargs)[source]#

Bases: AdvantageFn, ExperienceOperator

An advantage function that broadcasts advantages from the last step to previous steps. Inspired by rLLM (rllm-org/rllm).

__init__(epsilon: float = 1e-06, enable_step_norm: bool = False, std_cal_level: str = 'group', std_threshold: float | None = None, **kwargs) → None[source]#

Initialize the Step-wise GRPO advantage function.

Parameters:

epsilon (float) – A small value to avoid division by zero.
enable_step_norm (bool) – If True, normalize advantages by trajectory length.
std_cal_level (str) – The scope for calculating reward standard deviation. ‘group’ (default): Std is calculated per task group. ‘batch’: Std is calculated across all last-step rewards in the entire batch. The mean is always calculated per task group.
std_threshold (Optional[float]) – If provided, task groups with a reward standard deviation equal or below this threshold will be skipped.

calculate_last_step_advantage(exps: Dict[str, Experience], precomputed_std: Tensor | None = None) → Tuple[Dict[str, float], Dict[str, float], bool][source]#

Calculate group advantage for a given group of experiences.

Parameters:

exps (Dict[str, Experience]) – One experience per run, keyed by run ID.
precomputed_std (Optional[torch.Tensor]) – Precomputed standard deviation for batch-level calculation.

Returns:

Scores for each run. Dict[str, float]: Metrics for logging. bool: Whether this group should be skipped.

Return type:

Dict[str, float]

broadcast_advantages(run_exps: Dict[str, List[Experience]], scores: Dict[str, float]) → Dict[str, List[Experience]][source]#

Broadcast the calculated advantages to all previous steps in each run.

Parameters:

run_exps (Dict[str, List[Experience]]) – Experiences grouped by run ID.
scores (Dict[str, float]) – Calculated scores for each run.

Returns:

Updated experiences with advantages broadcasted.

Return type:

Dict[str, List[Experience]]

process(exps: List[Experience]) → Tuple[List[Experience], Dict][source]#

Process a list of experiences and return a transformed list.

Parameters:: exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
Returns:: A tuple containing the processed list of experiences and a dictionary of metrics.
Return type:: Tuple[List[Experience], Dict]

classmethod compute_in_trainer() → bool[source]#: Whether the advantage should be computed in the trainer loop.

classmethod default_args() → Dict[source]#: Return the default configuration for this strategy.

class trinity.algorithm.advantage_fn.REINFORCEPLUSPLUSAdvantageFn(gamma: float = 1.0)[source]#

Bases: AdvantageFn

__init__(gamma: float = 1.0) → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.REMAXAdvantageFn[source]#

Bases: AdvantageFn

__init__() → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.RLOOAdvantageFn[source]#

Bases: AdvantageFn

__init__() → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.OPMDAdvantageFn(opmd_baseline: str = 'mean', tau: float = 1.0)[source]#

Bases: AdvantageFn

OPMD advantage computation

__init__(opmd_baseline: str = 'mean', tau: float = 1.0) → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.OPMDGroupAdvantage(opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs)[source]#

Bases: GroupAdvantage

OPMD Group Advantage computation

__init__(opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs) → None[source]#

group_experiences(exps)[source]#

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]#

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

classmethod default_args() → dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.REINFORCEGroupAdvantage[source]#

Bases: GroupAdvantage

Reinforce Group Advantage computation

group_experiences(exps)[source]#

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]#

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

classmethod default_args() → dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.ASYMREAdvantageFn(baseline_shift: float = -0.1)[source]#

Bases: AdvantageFn

AsymRE advantage computation

__init__(baseline_shift: float = -0.1) → None[source]#

classmethod default_args() → Dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

class trinity.algorithm.advantage_fn.RECGroupedAdvantage(epsilon: float = 1e-06, std_normalize: bool | None = False, drop: str | None = None)[source]#

Bases: GroupAdvantage

An advantage class that calculates REC advantages.

__init__(epsilon: float = 1e-06, std_normalize: bool | None = False, drop: str | None = None) → None[source]#

Initialize the REC advantage function.

Parameters:

epsilon (float) – A small value to avoid division by zero.
std_normalize (Optional[bool]) – If provided, normalize the advantage with group-level reward standard deviation.
drop (Optional[str]) – Strategy to drop experiences. Options are “balance” or None.

group_experiences(exps)[source]#

Group experiences by a certain criterion.

Parameters:: exps (List[Experience]) – List of experiences to be grouped.
Returns:: A dictionary where keys are group identifiers and values are lists of experiences.
Return type:: Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) → Tuple[List[Experience], Dict][source]#

Calculate advantages for a group of experiences.

Parameters:

group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

classmethod default_args() → dict[source]#

Returns:: The default init arguments for the advantage function.
Return type:: Dict

trinity.algorithm.advantage_fn package

Contents

trinity.algorithm.advantage_fn package#

Submodules#

Module contents#