trinity.algorithm.policy_loss_fn.cispo_policy_loss module#

CISPO policy loss function. Refer to https://arxiv.org/abs/2506.13585 for details.

class trinity.algorithm.policy_loss_fn.cispo_policy_loss.CISPOPolicyLossFn(backend: str = 'verl', clip_range_low: float = 1.0, clip_range_high: float = 0.28, enable_mask_clip: bool = False, mask_clip_range_low: float = 1.0, mask_clip_range_high: float = 0.28, loss_agg_mode: str = 'token-mean')[源代码]#

基类:PolicyLossFn

__init__(backend: str = 'verl', clip_range_low: float = 1.0, clip_range_high: float = 0.28, enable_mask_clip: bool = False, mask_clip_range_low: float = 1.0, mask_clip_range_high: float = 0.28, loss_agg_mode: str = 'token-mean') None[源代码]#

Initialize the policy loss function.

参数:

backend -- The training framework/backend to use (e.g., "verl")

classmethod default_args() Dict[源代码]#
In the original paper:

we did not impose a lower bound on the IS weight by setting clip_range_low to a high value, instead, we only tuned clip_range_high

property select_keys#

Returns parameter keys mapped to the specific training framework's naming convention.