trinity.algorithm.policy_loss_fn

Submodules

trinity.algorithm.policy_loss_fn.dpo_loss module

DPO loss function.

class trinity.algorithm.policy_loss_fn.dpo_loss.DPOLossFn(backend: str = 'verl', beta: float = 0.1, label_smoothing: float = 0.0)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', beta: float = 0.1, label_smoothing: float = 0.0) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

trinity.algorithm.policy_loss_fn.mix_policy_loss module

Mix policy loss function.

class trinity.algorithm.policy_loss_fn.mix_policy_loss.MIXPolicyLossFn(backend: str = 'verl', mu: float = 0.1, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, repeat_times: int = 1, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, read_batch_size_usual: int = 1, read_batch_size_expert: int = 1, use_token_level_loss_in_sft: bool = True)[source]

Bases: PolicyLossFn

Implements a mixed policy loss combining GRPO and SFT losses.

This loss function applies different loss components to data based on whether it comes from an expert or not, as indicated by is_expert_mask. It combines: - GRPO loss (self.grpo_loss_fn) for non-expert data - SFT loss (self.sft_loss_fn) for expert data - Weighting parameter mu

The per-sample weights are normalized using either experience_per_gpu or gradient_accumulation, depending on whether dynamic batch sizing is enabled, to ensure consistent weighting across different batches of the same type experiences.

__init__(backend: str = 'verl', mu: float = 0.1, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, repeat_times: int = 1, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, read_batch_size_usual: int = 1, read_batch_size_expert: int = 1, use_token_level_loss_in_sft: bool = True) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

trinity.algorithm.policy_loss_fn.opmd_policy_loss module

OPMD policy loss function.

class trinity.algorithm.policy_loss_fn.opmd_policy_loss.OPMDPolicyLossFn(backend: str = 'verl', tau: float = 1.0)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', tau: float = 1.0) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

trinity.algorithm.policy_loss_fn.policy_loss_fn module

class trinity.algorithm.policy_loss_fn.policy_loss_fn.PolicyLossFnMeta(name, bases, dct)[source]

Bases: ABCMeta

Metaclass for policy loss functions that handles parameter name mapping and filtering.

ignore_keys = {'kwargs', 'logprob', 'self'}
class trinity.algorithm.policy_loss_fn.policy_loss_fn.PolicyLossFn(backend: str = 'verl')[source]

Bases: ABC

Abstract base class for policy loss functions.

This class provides the interface for implementing different policy gradient loss functions while handling parameter name mapping between different training frameworks.

__init__(backend: str = 'verl')[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

abstract classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

trinity.algorithm.policy_loss_fn.ppo_policy_loss module

PPO policy loss function.

Modified from https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/core_algos.py

class trinity.algorithm.policy_loss_fn.ppo_policy_loss.PPOPolicyLossFn(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

trinity.algorithm.policy_loss_fn.sft_loss module

SFT loss function.

class trinity.algorithm.policy_loss_fn.sft_loss.SFTLossFn(backend: str = 'verl', use_token_level_loss: bool = True)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', use_token_level_loss: bool = True) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args()[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

Module contents

class trinity.algorithm.policy_loss_fn.PolicyLossFn(backend: str = 'verl')[source]

Bases: ABC

Abstract base class for policy loss functions.

This class provides the interface for implementing different policy gradient loss functions while handling parameter name mapping between different training frameworks.

__init__(backend: str = 'verl')[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

abstract classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

class trinity.algorithm.policy_loss_fn.PPOPolicyLossFn(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

class trinity.algorithm.policy_loss_fn.OPMDPolicyLossFn(backend: str = 'verl', tau: float = 1.0)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', tau: float = 1.0) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

class trinity.algorithm.policy_loss_fn.DPOLossFn(backend: str = 'verl', beta: float = 0.1, label_smoothing: float = 0.0)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', beta: float = 0.1, label_smoothing: float = 0.0) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

class trinity.algorithm.policy_loss_fn.SFTLossFn(backend: str = 'verl', use_token_level_loss: bool = True)[source]

Bases: PolicyLossFn

__init__(backend: str = 'verl', use_token_level_loss: bool = True) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args()[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.

class trinity.algorithm.policy_loss_fn.MIXPolicyLossFn(backend: str = 'verl', mu: float = 0.1, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, repeat_times: int = 1, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, read_batch_size_usual: int = 1, read_batch_size_expert: int = 1, use_token_level_loss_in_sft: bool = True)[source]

Bases: PolicyLossFn

Implements a mixed policy loss combining GRPO and SFT losses.

This loss function applies different loss components to data based on whether it comes from an expert or not, as indicated by is_expert_mask. It combines: - GRPO loss (self.grpo_loss_fn) for non-expert data - SFT loss (self.sft_loss_fn) for expert data - Weighting parameter mu

The per-sample weights are normalized using either experience_per_gpu or gradient_accumulation, depending on whether dynamic batch sizing is enabled, to ensure consistent weighting across different batches of the same type experiences.

__init__(backend: str = 'verl', mu: float = 0.1, clip_range: float | None = None, clip_range_low: float | None = None, clip_range_high: float | None = None, use_dynamic_bsz: bool | None = None, repeat_times: int = 1, ppo_mini_batch_size: int = 1, ppo_micro_batch_size_per_gpu: int = 1, ngpus_trainer: int = 1, read_batch_size_usual: int = 1, read_batch_size_expert: int = 1, use_token_level_loss_in_sft: bool = True) None[source]

Initialize the policy loss function.

Parameters:

backend – The training framework/backend to use (e.g., “verl”)

classmethod default_args() Dict[source]

Get default initialization arguments for this loss function.

Returns:

The default init arguments for the policy loss function.

Return type:

Dict

property select_keys

Returns parameter keys mapped to the specific training framework’s naming convention.