trinity.algorithm.policy_loss_fn.chord_policy_loss module#

Implements the CHORD policy loss function.

trinity.algorithm.policy_loss_fn.chord_policy_loss.mu_schedule_function(global_step: int, mu_warmup_steps: int, mu_decay_steps: int, mu_peak: float, mu_valley: float) float[源代码]#

Computes a cosine decay schedule with a warmup phase for the mu parameter.

class trinity.algorithm.policy_loss_fn.chord_policy_loss.SFTISLossFn(backend: str = 'verl', loss_agg_mode: str = 'token-mean')[源代码]#

基类:PolicyLossFn

SFT loss with importance sampling

__init__(backend: str = 'verl', loss_agg_mode: str = 'token-mean') None[源代码]#

Initialize the policy loss function.

参数:

backend -- The training framework/backend to use (e.g., "verl")

classmethod default_args()[源代码]#

Get default initialization arguments for this loss function.

返回:

The default init arguments for the policy loss function.

返回类型:

Dict

property select_keys#

Returns parameter keys mapped to the specific training framework's naming convention.

trinity.algorithm.policy_loss_fn.chord_policy_loss.phi_function(token_prob)[源代码]#

The phi function downweights token with extreme probability. Feel free to modify this function.

class trinity.algorithm.policy_loss_fn.chord_policy_loss.SFTPhiLossFn(backend: str = 'verl', loss_agg_mode: str = 'token-mean', cutoff_prob: float = 1.0)[源代码]#

基类:PolicyLossFn

SFT loss with transformed phi function

__init__(backend: str = 'verl', loss_agg_mode: str = 'token-mean', cutoff_prob: float = 1.0) None[源代码]#

Initialize the policy loss function.

参数:

backend -- The training framework/backend to use (e.g., "verl")

classmethod default_args()[源代码]#

Get default initialization arguments for this loss function.

返回:

The default init arguments for the policy loss function.

返回类型:

Dict

property select_keys#

Returns parameter keys mapped to the specific training framework's naming convention.

class trinity.algorithm.policy_loss_fn.chord_policy_loss.MIXCHORDPolicyLossFn(backend: str = 'verl', mu_warmup_steps: int = 0, mu_decay_steps: int = 0, mu_peak: float = 0.1, mu_valley: float = 0.1, enable_phi_function: bool = True, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, train_batch_size_usual: int = 1, train_batch_size_expert: int = 1, loss_agg_mode: str = 'token-mean', sft_loss_agg_mode: str | None = None, grpo_loss_agg_mode: str | None = None)[源代码]#

基类:PolicyLossFn

Implements a mixed policy loss combining GRPO and SFT losses.

This loss function applies different loss components to data based on whether it comes from an expert or not, as indicated by expert_mask. It combines:

  • GRPO loss (self.grpo_loss_fn) for non-expert data

  • SFT loss (self.sft_loss_fn) for expert data

    the weight of SFT loss is globally controled by mu_schedule function the tokenwise weights are calculated using different SFT loss formulas

The per-sample weights are normalized using either experience_per_gpu or gradient_accumulation, depending on whether dynamic batch sizing is enabled, to ensure consistent weighting across different batches of the same type experiences.

__init__(backend: str = 'verl', mu_warmup_steps: int = 0, mu_decay_steps: int = 0, mu_peak: float = 0.1, mu_valley: float = 0.1, enable_phi_function: bool = True, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, train_batch_size_usual: int = 1, train_batch_size_expert: int = 1, loss_agg_mode: str = 'token-mean', sft_loss_agg_mode: str | None = None, grpo_loss_agg_mode: str | None = None) None[源代码]#

Initialize the policy loss function.

参数:

backend -- The training framework/backend to use (e.g., "verl")

classmethod default_args() Dict[源代码]#

mu_warmup_steps: int, mu_decay_steps: int, mu_peak: float, mu_valley: float

property select_keys#

Returns parameter keys mapped to the specific training framework's naming convention.