trinity.algorithm.policy_loss_fn package#
Submodules#
- trinity.algorithm.policy_loss_fn.chord_policy_loss module
- trinity.algorithm.policy_loss_fn.cispo_policy_loss module
- trinity.algorithm.policy_loss_fn.dpo_loss module
- trinity.algorithm.policy_loss_fn.gspo_policy_loss module
- trinity.algorithm.policy_loss_fn.mix_policy_loss module
- trinity.algorithm.policy_loss_fn.opmd_policy_loss module
- trinity.algorithm.policy_loss_fn.policy_loss_fn module
- trinity.algorithm.policy_loss_fn.ppo_policy_loss module
- trinity.algorithm.policy_loss_fn.rec_policy_loss module
- trinity.algorithm.policy_loss_fn.sft_loss module
- trinity.algorithm.policy_loss_fn.sppo_loss_fn module
- trinity.algorithm.policy_loss_fn.topr_policy_loss module
Module contents#
- class trinity.algorithm.policy_loss_fn.PolicyLossFn(backend: str = 'verl')[source]#
Bases:
ABCAbstract base class for policy loss functions.
This class provides the interface for implementing different policy gradient loss functions while handling parameter name mapping between different training frameworks.
- __init__(backend: str = 'verl')[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- abstract classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.PPOPolicyLossFn(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, clip_ratio_c: float = 3.0, loss_agg_mode: str | None = 'token-mean', enable_sequence_masking: bool = False, delta_sequence_masking: float = 0.1)[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, clip_ratio_c: float = 3.0, loss_agg_mode: str | None = 'token-mean', enable_sequence_masking: bool = False, delta_sequence_masking: float = 0.1) None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.OPMDPolicyLossFn(backend: str = 'verl', tau: float = 1.0, loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', tau: float = 1.0, loss_agg_mode: str = 'token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.DPOLossFn(backend: str = 'verl', beta: float = 0.1, label_smoothing: float = 0.0)[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', beta: float = 0.1, label_smoothing: float = 0.0) None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.SFTLossFn(backend: str = 'verl', loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', loss_agg_mode: str = 'token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args()[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.MIXPolicyLossFn(backend: str = 'verl', mu: float = 0.1, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, train_batch_size_usual: int = 1, train_batch_size_expert: int = 1, loss_agg_mode: str = 'token-mean', sft_loss_agg_mode: str | None = None, grpo_loss_agg_mode: str | None = None)[source]#
Bases:
PolicyLossFnImplements a mixed policy loss combining GRPO and SFT losses.
This loss function applies different loss components to data based on whether it comes from an expert or not, as indicated by expert_mask. It combines: - GRPO loss (self.grpo_loss_fn) for non-expert data - SFT loss (self.sft_loss_fn) for expert data - Weighting parameter mu
The per-sample weights are normalized using either experience_per_gpu or gradient_accumulation, depending on whether dynamic batch sizing is enabled, to ensure consistent weighting across different batches of the same type experiences.
- __init__(backend: str = 'verl', mu: float = 0.1, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, train_batch_size_usual: int = 1, train_batch_size_expert: int = 1, loss_agg_mode: str = 'token-mean', sft_loss_agg_mode: str | None = None, grpo_loss_agg_mode: str | None = None) None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.GSPOLossFn(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, loss_agg_mode: str | None = 'seq-mean-token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, loss_agg_mode: str | None = 'seq-mean-token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.TOPRPolicyLossFn(backend: str = 'verl', advantage_threshold: float = 0.0, loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', advantage_threshold: float = 0.0, loss_agg_mode: str = 'token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.CISPOPolicyLossFn(backend: str = 'verl', clip_range_low: float = 1.0, clip_range_high: float = 0.28, enable_mask_clip: bool = False, mask_clip_range_low: float = 1.0, mask_clip_range_high: float = 0.28, loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', clip_range_low: float = 1.0, clip_range_high: float = 0.28, enable_mask_clip: bool = False, mask_clip_range_low: float = 1.0, mask_clip_range_high: float = 0.28, loss_agg_mode: str = 'token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
- In the original paper:
we did not impose a lower bound on the IS weight by setting clip_range_low to a high value, instead, we only tuned clip_range_high
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.MIXCHORDPolicyLossFn(backend: str = 'verl', mu_warmup_steps: int = 0, mu_decay_steps: int = 0, mu_peak: float = 0.1, mu_valley: float = 0.1, enable_phi_function: bool = True, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, train_batch_size_usual: int = 1, train_batch_size_expert: int = 1, loss_agg_mode: str = 'token-mean', sft_loss_agg_mode: str | None = None, grpo_loss_agg_mode: str | None = None)[source]#
Bases:
PolicyLossFnImplements a mixed policy loss combining GRPO and SFT losses.
This loss function applies different loss components to data based on whether it comes from an expert or not, as indicated by expert_mask. It combines:
GRPO loss (self.grpo_loss_fn) for non-expert data
- SFT loss (self.sft_loss_fn) for expert data
the weight of SFT loss is globally controled by mu_schedule function the tokenwise weights are calculated using different SFT loss formulas
The per-sample weights are normalized using either experience_per_gpu or gradient_accumulation, depending on whether dynamic batch sizing is enabled, to ensure consistent weighting across different batches of the same type experiences.
- __init__(backend: str = 'verl', mu_warmup_steps: int = 0, mu_decay_steps: int = 0, mu_peak: float = 0.1, mu_valley: float = 0.1, enable_phi_function: bool = True, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, train_batch_size_usual: int = 1, train_batch_size_expert: int = 1, loss_agg_mode: str = 'token-mean', sft_loss_agg_mode: str | None = None, grpo_loss_agg_mode: str | None = None) None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
mu_warmup_steps: int, mu_decay_steps: int, mu_peak: float, mu_valley: float
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.SFTISLossFn(backend: str = 'verl', loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFnSFT loss with importance sampling
- __init__(backend: str = 'verl', loss_agg_mode: str = 'token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args()[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.SFTPhiLossFn(backend: str = 'verl', loss_agg_mode: str = 'token-mean', cutoff_prob: float = 1.0)[source]#
Bases:
PolicyLossFnSFT loss with transformed phi function
- __init__(backend: str = 'verl', loss_agg_mode: str = 'token-mean', cutoff_prob: float = 1.0) None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args()[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.sPPOPolicyLossFn(backend: str = 'verl', epsilon: float = 0.3, loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', epsilon: float = 0.3, loss_agg_mode: str = 'token-mean') None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.RECPolicyLossFn(backend: str = 'verl', epsilon_low: float = 0.2, epsilon_high: float = 0.2, epsilon_low_prime: float = 0.4, epsilon_high_prime: float = 0.4, clip_mode: str = 'none', weight: str = 'none', regularizer: str = 'none', regularizer_coef: float = 0.0, temp: float = 1.0)[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', epsilon_low: float = 0.2, epsilon_high: float = 0.2, epsilon_low_prime: float = 0.4, epsilon_high_prime: float = 0.4, clip_mode: str = 'none', weight: str = 'none', regularizer: str = 'none', regularizer_coef: float = 0.0, temp: float = 1.0) None[source]#
Initialize the policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
- classmethod default_args() Dict[source]#
Get default initialization arguments for this loss function.
- Returns:
The default init arguments for the policy loss function.
- Return type:
Dict
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.
- class trinity.algorithm.policy_loss_fn.SAPOPolicyLossFn(backend: str = 'verl', tau_pos: float = 1.0, tau_neg: float = 1.05, loss_agg_mode: str = 'token-mean')[source]#
Bases:
PolicyLossFn- __init__(backend: str = 'verl', tau_pos: float = 1.0, tau_neg: float = 1.05, loss_agg_mode: str = 'token-mean') None[source]#
Initialize SAPO policy loss function.
- Parameters:
backend – The training framework/backend to use (e.g., “verl”)
tau_pos – Temperature for positive advantages (τ_pos), default 1.0
tau_neg – Temperature for negative advantages (τ_neg), default 1.05, should be >= tau_pos
loss_agg_mode – Mode for aggregating loss across tokens
- soft_gate_function(ratio: Tensor, advantages: Tensor) Tensor[source]#
Compute the soft gate function f_{i,t}(x).
- The soft gate function is defined as:
f_{i,t}(x) = σ(τ_{i,t} * (x - 1)) * 4 / τ_{i,t}
- where:
σ is the sigmoid function
τ_{i,t} is the asymmetric temperature (tau_pos or tau_neg)
x is the importance sampling ratio r_{i,t}(θ)
- Parameters:
ratio – Token-level importance sampling ratio r_{i,t}(θ)
advantages – Normalized advantage function Â_i (same for all tokens in a sequence)
- Returns:
The soft gate values for each token
- classmethod default_args() Dict[source]#
Get default initialization arguments for SAPO.
- Default configuration (from the SAPO paper):
tau_pos: 1.0 (temperature for positive advantages)
tau_neg: 1.05 (temperature for negative advantages)
loss_agg_mode: “token-mean” (average over tokens)
The asymmetric temperatures (tau_neg > tau_pos) help stabilize training by more aggressively suppressing updates from tokens with negative advantages.
- Returns:
Dictionary of default arguments
- property select_keys#
Returns parameter keys mapped to the specific training framework’s naming convention.