trinity.algorithm.advantage_fn package#
Submodules#
- trinity.algorithm.advantage_fn.advantage_fn module
- trinity.algorithm.advantage_fn.asymre_advantage module
- trinity.algorithm.advantage_fn.grpo_advantage module
- trinity.algorithm.advantage_fn.multi_step_grpo_advantage module
- trinity.algorithm.advantage_fn.opmd_advantage module
- trinity.algorithm.advantage_fn.ppo_advantage module
- trinity.algorithm.advantage_fn.rec_advantage module
- trinity.algorithm.advantage_fn.reinforce_advantage module
- trinity.algorithm.advantage_fn.reinforce_plus_plus_advantage module
- trinity.algorithm.advantage_fn.remax_advantage module
- trinity.algorithm.advantage_fn.rloo_advantage module
Module contents#
- class trinity.algorithm.advantage_fn.AdvantageFn[source]#
Bases:
ABC
- class trinity.algorithm.advantage_fn.GroupAdvantage[source]#
Bases:
AdvantageFn
,ExperienceOperator
For group-based advantages calculation.
- abstract group_experiences(exps: List[Experience]) Dict[str, List[Experience]] [source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) – List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- abstract calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict] [source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- process(exps: List[Experience]) Tuple[List[Experience], Dict] [source]#
Process a list of experiences and return a transformed list.
- Parameters:
exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
- Returns:
A tuple containing the processed list of experiences and a dictionary of metrics.
- Return type:
Tuple[List[Experience], Dict]
- class trinity.algorithm.advantage_fn.PPOAdvantageFn(gamma: float = 1.0, lam: float = 1.0)[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.GRPOAdvantageFn(epsilon: float = 1e-06)[source]#
Bases:
AdvantageFn
GRPO advantage computation
- class trinity.algorithm.advantage_fn.GRPOGroupedAdvantage(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group')[source]#
Bases:
GroupAdvantage
An advantage class that calculates GRPO advantages.
- __init__(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group') None [source]#
Initialize the GRPO advantage function.
- Parameters:
epsilon (float) – A small value to avoid division by zero.
std_threshold (Optional[float]) – If provided, groups with a reward standard deviation equal or below this threshold will be skipped.
duplicate_experiences (bool) – If True, allows duplicate experiences to keep the original experience count. Only used when std_threshold is not None (https://hkunlp.github.io/blog/2025/Polaris).
rank_penalty (Optional[float]) – A penalty applied to the rank of rewards to correct for bias (https://arxiv.org/pdf/2506.02355).
std_cal_level (str) – The scope for calculating the reward standard deviation for normalization. Can be ‘group’ (default, std is calculated per group) or ‘batch’ (std is calculated across the entire batch). The mean is always calculated per group. Calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping(https://arxiv.org/pdf/2508.08221v1).
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) – List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience], precomputed_std: Tensor | None = None) Tuple[List[Experience], Dict] [source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- process(exps)[source]#
Process a list of experiences and return a transformed list.
- Parameters:
exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
- Returns:
A tuple containing the processed list of experiences and a dictionary of metrics.
- Return type:
Tuple[List[Experience], Dict]
- class trinity.algorithm.advantage_fn.StepWiseGRPOAdvantageFn(epsilon: float = 1e-06, enable_step_norm: bool = False, std_cal_level: str = 'group', **kwargs)[source]#
Bases:
AdvantageFn
,ExperienceOperator
An advantage function that broadcasts advantages from the last step to previous steps. Inspired by rLLM (rllm-org/rllm).
- __init__(epsilon: float = 1e-06, enable_step_norm: bool = False, std_cal_level: str = 'group', **kwargs) None [source]#
Initialize the Step-wise GRPO advantage function.
- Parameters:
epsilon (float) – A small value to avoid division by zero.
enable_step_norm (bool) – If True, normalize advantages by trajectory length.
std_cal_level (str) – The scope for calculating reward standard deviation. ‘group’ (default): Std is calculated per task group. ‘batch’: Std is calculated across all last-step rewards in the entire batch. The mean is always calculated per task group.
- calculate_last_step_advantage(exps: Dict[str, Experience], precomputed_std: Tensor | None = None) Tuple[Dict[str, float], Dict[str, float]] [source]#
Calculate group advantage for a given group of experiences.
- Parameters:
exps (Dict[str, Experience]) – One experience per run, keyed by run ID.
- Returns:
A tuple containing the scores for each run. Dict[str, float]: Metrics for logging.
- Return type:
Dict[str, float]
- broadcast_advantages(run_exps: Dict[str, List[Experience]], scores: Dict[str, float]) Dict[str, List[Experience]] [source]#
Broadcast the calculated advantages to all previous steps in each run.
- Parameters:
run_exps (Dict[str, List[Experience]]) – Experiences grouped by run ID.
scores (Dict[str, float]) – Calculated scores for each run.
- Returns:
Updated experiences with advantages broadcasted.
- Return type:
Dict[str, List[Experience]]
- process(exps: List[Experience]) Tuple[List[Experience], Dict] [source]#
Process a list of experiences and return a transformed list.
- Parameters:
exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.
- Returns:
A tuple containing the processed list of experiences and a dictionary of metrics.
- Return type:
Tuple[List[Experience], Dict]
- class trinity.algorithm.advantage_fn.REINFORCEPLUSPLUSAdvantageFn(gamma: float = 1.0)[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.REMAXAdvantageFn[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.RLOOAdvantageFn[source]#
Bases:
AdvantageFn
- class trinity.algorithm.advantage_fn.OPMDAdvantageFn(opmd_baseline: str = 'mean', tau: float = 1.0)[source]#
Bases:
AdvantageFn
OPMD advantage computation
- class trinity.algorithm.advantage_fn.OPMDGroupAdvantage(opmd_baseline: str = 'mean', tau: float = 1.0, **kwargs)[source]#
Bases:
GroupAdvantage
OPMD Group Advantage computation
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) – List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict] [source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- class trinity.algorithm.advantage_fn.REINFORCEGroupAdvantage[source]#
Bases:
GroupAdvantage
Reinforce Group Advantage computation
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) – List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict] [source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]
- class trinity.algorithm.advantage_fn.ASYMREAdvantageFn(baseline_shift: float = -0.1)[source]#
Bases:
AdvantageFn
AsymRE advantage computation
- class trinity.algorithm.advantage_fn.RECGroupedAdvantage(epsilon: float = 1e-06, std_normalize: bool | None = False, drop: str | None = None)[source]#
Bases:
GroupAdvantage
An advantage class that calculates REC advantages.
- __init__(epsilon: float = 1e-06, std_normalize: bool | None = False, drop: str | None = None) None [source]#
Initialize the REC advantage function.
- Parameters:
epsilon (float) – A small value to avoid division by zero.
std_normalize (Optional[bool]) – If provided, normalize the advantage with group-level reward standard deviation.
drop (Optional[str]) – Strategy to drop experiences. Options are “balance” or None.
- group_experiences(exps)[source]#
Group experiences by a certain criterion.
- Parameters:
exps (List[Experience]) – List of experiences to be grouped.
- Returns:
A dictionary where keys are group identifiers and values are lists of experiences.
- Return type:
Dict[str, List[Experience]]
- calculate_group_advantage(group_id: str, exps: List[Experience]) Tuple[List[Experience], Dict] [source]#
Calculate advantages for a group of experiences.
- Parameters:
group_id (str) – The identifier for the group of experiences.
exps (List[Experience]) – List of experiences in the group.
- Returns:
A tuple containing the modified list of experiences and a dictionary of metrics.
- Return type:
List[Experience]