trinity.algorithm.advantage_fn package

Submodules

Module contents

class trinity.algorithm.advantage_fn.AdvantageFn[source]

Bases: ABC

abstract classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

classmethod compute_in_trainer() bool[source]

Whether the advantage should be computed in the trainer loop.

class trinity.algorithm.advantage_fn.GroupAdvantage[source]

Bases: AdvantageFn, ExperienceOperator

For group-based advantages calculation.

abstract group_experiences(exps: List[Experience]) Dict[str, List[Experience]][source]

Group experiences by a certain criterion.

Parameters:

exps (List[Experience]) – List of experiences to be grouped.

Returns:

A dictionary where keys are group identifiers and values are lists of experiences.

Return type:

Dict[str, List[Experience]]

abstract calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:
  • group_id (str) – The identifier for the group of experiences.

  • exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

process(exps: List[Experience]) Tuple[List[Experience], Dict][source]

Process a list of experiences and return a transformed list.

Parameters:

exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.

Returns:

A tuple containing the processed list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod compute_in_trainer() bool[source]

Whether the advantage should be computed in the trainer loop.

class trinity.algorithm.advantage_fn.PPOAdvantageFn(gamma: float = 1.0, lam: float = 1.0)[source]

Bases: AdvantageFn

__init__(gamma: float = 1.0, lam: float = 1.0) None[source]
classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.GRPOAdvantageFn(epsilon: float = 1e-06)[source]

Bases: AdvantageFn

GRPO advantage computation

__init__(epsilon: float = 1e-06) None[source]
classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.GRPOGroupedAdvantage(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None)[source]

Bases: GroupAdvantage

An example AddStrategy that calculates GRPO advantages.

__init__(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None) None[source]

Initialize the GRPO advantage function.

Parameters:
  • epsilon (float) – A small value to avoid division by zero.

  • std_threshold (Optional[float]) – If provided, groups with a reward standard deviation equal or below this threshold will be skipped.

  • duplicate_experiences (bool) – If True, allows duplicate experiences to keep the original experience count. Only used when std_threshold is not None (https://hkunlp.github.io/blog/2025/Polaris).

  • rank_penalty (Optional[float]) – A penalty applied to the rank of rewards to correct for bias (https://arxiv.org/pdf/2506.02355).

group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:

exps (List[Experience]) – List of experiences to be grouped.

Returns:

A dictionary where keys are group identifiers and values are lists of experiences.

Return type:

Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:
  • group_id (str) – The identifier for the group of experiences.

  • exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

process(exps)[source]

Process a list of experiences and return a transformed list.

Parameters:

exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.

Returns:

A tuple containing the processed list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.StepWiseGRPOAdvantageFn(epsilon: float = 1e-06, enable_step_norm: bool = False, **kwargs)[source]

Bases: AdvantageFn, ExperienceOperator

An advantage function that broadcasts advantages from the last step to previous steps. Inspired by rLLM (https://github.com/rllm-org/rllm).

__init__(epsilon: float = 1e-06, enable_step_norm: bool = False, **kwargs) None[source]
calculate_last_step_advantage(exps: Dict[str, Experience]) Tuple[Dict[str, float], Dict[str, float]][source]

Calculate group advantage for a given group of experiences.

Parameters:

exps (Dict[str, Experience]) – One experience per run, keyed by run ID.

Returns:

A tuple containing the scores for each run. Dict[str, float]: Metrics for logging.

Return type:

Dict[str, float]

broadcast_advantages(run_exps: Dict[str, List[Experience]], scores: Dict[str, float]) Dict[str, List[Experience]][source]

Broadcast the calculated advantages to all previous steps in each run.

Parameters:
  • run_exps (Dict[str, List[Experience]]) – Experiences grouped by run ID.

  • scores (Dict[str, float]) – Calculated scores for each run.

Returns:

Updated experiences with advantages broadcasted.

Return type:

Dict[str, List[Experience]]

process(exps: List[Experience]) Tuple[List[Experience], Dict][source]

Process a list of experiences and return a transformed list.

Parameters:

exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.

Returns:

A tuple containing the processed list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod compute_in_trainer() bool[source]

Whether the advantage should be computed in the trainer loop.

classmethod default_args() Dict[source]

Return the default configuration for this strategy.

class trinity.algorithm.advantage_fn.REINFORCEPLUSPLUSAdvantageFn(gamma: float = 1.0)[source]

Bases: AdvantageFn

__init__(gamma: float = 1.0) None[source]
classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.REMAXAdvantageFn[source]

Bases: AdvantageFn

__init__() None[source]
classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.RLOOAdvantageFn[source]

Bases: AdvantageFn

__init__() None[source]
classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.OPMDAdvantageFn(opmd_baseline: str = 'mean', tau: float = 1.0)[source]

Bases: AdvantageFn

OPMD advantage computation

__init__(opmd_baseline: str = 'mean', tau: float = 1.0) None[source]
classmethod default_args() Dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.OPMDGroupAdvantage(opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs)[source]

Bases: GroupAdvantage

OPMD Group Advantage computation

__init__(opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs) None[source]
group_experiences(exps)[source]

Group experiences by a certain criterion.

Parameters:

exps (List[Experience]) – List of experiences to be grouped.

Returns:

A dictionary where keys are group identifiers and values are lists of experiences.

Return type:

Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]

Calculate advantages for a group of experiences.

Parameters:
  • group_id (str) – The identifier for the group of experiences.

  • exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

classmethod default_args() dict[source]
Returns:

The default init arguments for the advantage function.

Return type:

Dict