trinity.algorithm.advantage_fn package#
Submodules#
- trinity.algorithm.advantage_fn.advantage_fn module
- trinity.algorithm.advantage_fn.asymre_advantage module
- trinity.algorithm.advantage_fn.grpo_advantage module
- trinity.algorithm.advantage_fn.multi_step_grpo_advantage module
- trinity.algorithm.advantage_fn.opmd_advantage module
- trinity.algorithm.advantage_fn.ppo_advantage module
- trinity.algorithm.advantage_fn.rec_advantage module
- trinity.algorithm.advantage_fn.reinforce_advantage module
- trinity.algorithm.advantage_fn.reinforce_plus_plus_advantage module
- trinity.algorithm.advantage_fn.remax_advantage module
- trinity.algorithm.advantage_fn.rloo_advantage module
Module contents#
- class trinity.algorithm.advantage_fn.AdvantageFn[source]#
Bases:
ABC
- class trinity.algorithm.advantage_fn.GroupAdvantage[source]#
Bases:
AdvantageFn,ExperienceOperatorFor group-based advantages calculation.
- abstract group_experiences(exps: List[Experience]) Dict[str, List[Experience]][source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) β List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- abstract calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) β The identifier for the group of experiences.
exps (List[Experience]) β List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- process(exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Process a list of experiences and return a transformed list.
- Parameters:
exps (List[Experience]) β List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
- Returns:
A tuple containing the processed list of experiences and a dictionary of metrics.
- Return type:
Tuple[List[Experience], Dict]
- class trinity.algorithm.advantage_fn.PPOAdvantageFn(gamma: float = 1.0, lam: float = 1.0)[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.GRPOAdvantageFn(epsilon: float = 1e-06)[source]#
Bases:
AdvantageFnGRPO advantage computation
- class trinity.algorithm.advantage_fn.GRPOGroupedAdvantage(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group')[source]#
Bases:
GroupAdvantageAn advantage class that calculates GRPO advantages.
- __init__(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group') None[source]#
Initialize the GRPO advantage function.
- Parameters:
epsilon (float) β A small value to avoid division by zero.
std_threshold (Optional[float]) β If provided, groups with a reward standard deviation equal or below this threshold will be skipped.
duplicate_experiences (bool) β If True, allows duplicate experiences to keep the original experience count. Only used when std_threshold is not None (https://hkunlp.github.io/blog/2025/Polaris).
rank_penalty (Optional[float]) β A penalty applied to the rank of rewards to correct for bias (https://arxiv.org/pdf/2506.02355).
std_cal_level (str) β The scope for calculating the reward standard deviation for normalization. Can be βgroupβ (default, std is calculated per group) or βbatchβ (std is calculated across the entire batch). The mean is always calculated per group. Calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping(https://arxiv.org/pdf/2508.08221v1).
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) β List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience], precomputed_std: Tensor | None = None) Tuple[List[Experience], Dict][source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) β The identifier for the group of experiences.
exps (List[Experience]) β List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- process(exps)[source]#
Process a list of experiences and return a transformed list.
- Parameters:
exps (List[Experience]) β List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
- Returns:
A tuple containing the processed list of experiences and a dictionary of metrics.
- Return type:
Tuple[List[Experience], Dict]
- class trinity.algorithm.advantage_fn.StepWiseGRPOAdvantageFn(epsilon: float = 1e-06, enable_step_norm: bool = False, std_cal_level: str = 'group', std_threshold: float | None = None, **kwargs)[source]#
Bases:
AdvantageFn,ExperienceOperatorAn advantage function that broadcasts advantages from the last step to previous steps. Inspired by rLLM (rllm-org/rllm).
- __init__(epsilon: float = 1e-06, enable_step_norm: bool = False, std_cal_level: str = 'group', std_threshold: float | None = None, **kwargs) None[source]#
Initialize the Step-wise GRPO advantage function.
- Parameters:
epsilon (float) β A small value to avoid division by zero.
enable_step_norm (bool) β If True, normalize advantages by trajectory length.
std_cal_level (str) β The scope for calculating reward standard deviation. βgroupβ (default): Std is calculated per task group. βbatchβ: Std is calculated across all last-step rewards in the entire batch. The mean is always calculated per task group.
std_threshold (Optional[float]) β If provided, task groups with a reward standard deviation equal or below this threshold will be skipped.
- calculate_last_step_advantage(exps: Dict[str, Experience], precomputed_std: Tensor | None = None) Tuple[Dict[str, float], Dict[str, float], bool][source]#
Calculate group advantage for a given group of experiences.
- Parameters:
exps (Dict[str, Experience]) β One experience per run, keyed by run ID.
precomputed_std (Optional[torch.Tensor]) β Precomputed standard deviation for batch-level calculation.
- Returns:
Scores for each run. Dict[str, float]: Metrics for logging. bool: Whether this group should be skipped.
- Return type:
Dict[str, float]
- broadcast_advantages(run_exps: Dict[str, List[Experience]], scores: Dict[str, float]) Dict[str, List[Experience]][source]#
Broadcast the calculated advantages to all previous steps in each run.
- Parameters:
run_exps (Dict[str, List[Experience]]) β Experiences grouped by run ID.
scores (Dict[str, float]) β Calculated scores for each run.
- Returns:
Updated experiences with advantages broadcasted.
- Return type:
Dict[str, List[Experience]]
- process(exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Process a list of experiences and return a transformed list.
- Parameters:
exps (List[Experience]) β List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
- Returns:
A tuple containing the processed list of experiences and a dictionary of metrics.
- Return type:
Tuple[List[Experience], Dict]
- class trinity.algorithm.advantage_fn.REINFORCEPLUSPLUSAdvantageFn(gamma: float = 1.0)[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.REMAXAdvantageFn[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.RLOOAdvantageFn[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.OPMDAdvantageFn(opmd_baseline: str = 'mean', tau: float = 1.0)[source]#
Bases:
AdvantageFnOPMD advantage computation
- class trinity.algorithm.advantage_fn.OPMDGroupAdvantage(opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs)[source]#
Bases:
GroupAdvantageOPMD Group Advantage computation
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) β List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) β The identifier for the group of experiences.
exps (List[Experience]) β List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- class trinity.algorithm.advantage_fn.REINFORCEGroupAdvantage[source]#
Bases:
GroupAdvantageReinforce Group Advantage computation
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) β List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) β The identifier for the group of experiences.
exps (List[Experience]) β List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- class trinity.algorithm.advantage_fn.ASYMREAdvantageFn(baseline_shift: float = -0.1)[source]#
Bases:
AdvantageFnAsymRE advantage computation
- class trinity.algorithm.advantage_fn.RECGroupedAdvantage(epsilon: float = 1e-06, std_normalize: bool | None = False, drop: str | None = None)[source]#
Bases:
GroupAdvantageAn advantage class that calculates REC advantages.
- __init__(epsilon: float = 1e-06, std_normalize: bool | None = False, drop: str | None = None) None[source]#
Initialize the REC advantage function.
- Parameters:
epsilon (float) β A small value to avoid division by zero.
std_normalize (Optional[bool]) β If provided, normalize the advantage with group-level reward standard deviation.
drop (Optional[str]) β Strategy to drop experiences. Options are βbalanceβ or None.
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) β List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) β The identifier for the group of experiences.
exps (List[Experience]) β List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]