trinity.algorithm.policy_loss_fn.sapo_policy_loss module#

SAPO policy loss function. Soft Adaptive Policy Optimization (SAPO) is a reinforcement learning algorithm that uses a smooth, temperature-controlled soft gate instead of hard clipping.

Refer to the SAPO paper for details. https://arxiv.org/abs/2511.20347

class trinity.algorithm.policy_loss_fn.sapo_policy_loss.SAPOPolicyLossFn(backend: str = 'verl', tau_pos: float = 1.0, tau_neg: float = 1.05, loss_agg_mode: str = 'token-mean')[source]#

Bases: PolicyLossFn

__init__(backend: str = 'verl', tau_pos: float = 1.0, tau_neg: float = 1.05, loss_agg_mode: str = 'token-mean') → None[source]#

Initialize SAPO policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)
tau_pos – Temperature for positive advantages (τ_pos), default 1.0
tau_neg – Temperature for negative advantages (τ_neg), default 1.05, should be >= tau_pos
loss_agg_mode – Mode for aggregating loss across tokens

soft_gate_function(ratio: Tensor, advantages: Tensor) → Tensor[source]#

Compute the soft gate function f_{i,t}(x).

The soft gate function is defined as:

f_{i,t}(x) = σ(τ_{i,t} * (x - 1)) * 4 / τ_{i,t}

where:

σ is the sigmoid function
τ_{i,t} is the asymmetric temperature (tau_pos or tau_neg)
x is the importance sampling ratio r_{i,t}(θ)

Parameters:

ratio – Token-level importance sampling ratio r_{i,t}(θ)
advantages – Normalized advantage function Â_i (same for all tokens in a sequence)

Returns:

The soft gate values for each token

classmethod default_args() → Dict[source]#

Get default initialization arguments for SAPO.

Default configuration (from the SAPO paper):

tau_pos: 1.0 (temperature for positive advantages)
tau_neg: 1.05 (temperature for negative advantages)
loss_agg_mode: “token-mean” (average over tokens)

The asymmetric temperatures (tau_neg > tau_pos) help stabilize training by more aggressively suppressing updates from tokens with negative advantages.

Returns:: Dictionary of default arguments

property select_keys#: Returns parameter keys mapped to the specific training framework’s naming convention.

trinity.algorithm.policy_loss_fn.sapo_policy_loss module

Contents

trinity.algorithm.policy_loss_fn.sapo_policy_loss module#