trinity.trainer.verl
Submodules
trinity.trainer.verl.core_algos module
Modified from core_algos.py
- class trinity.trainer.verl.core_algos.AdaptiveKLController(init_kl_coef, target_kl, horizon)[source]
Bases:
KLController
Adaptive KL controller described in the paper: https://arxiv.org/pdf/1909.08593.pdf
- class trinity.trainer.verl.core_algos.FixedKLController(kl_coef)[source]
Bases:
KLController
Fixed KL controller.
- trinity.trainer.verl.core_algos.compute_opmd_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, index: Tensor, opmd_baseline: str = 'mean', tau: float = 1.0)[source]
Modified from compute_grpo_outcome_advantage
Compute advantage for OPMD, operating only on Outcome reward (with only one scalar reward for each response). :param token_level_rewards: (torch.Tensor)
shape: (bs, response_length)
- Parameters:
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
- (torch.Tensor)
shape: (bs, response_length)
- Returns: (torch.Tensor)
shape: (bs, response_length)
- Return type:
advantages
- trinity.trainer.verl.core_algos.compute_gae_advantage_return(token_level_rewards: Tensor, values: Tensor, eos_mask: Tensor, gamma: Tensor, lam: Tensor)[source]
Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py
- Parameters:
token_level_rewards – (torch.Tensor) shape: (bs, response_length)
values – (torch.Tensor) shape: (bs, response_length)
eos_mask – (torch.Tensor) shape: (bs, response_length). [EOS] mask. The token after [EOS] have mask zero.
gamma – (float) discounted factor used in RL
lam – (float) lambda value when computing Generalized Advantage Estimation (https://arxiv.org/abs/1506.02438)
- Returns:
- (torch.Tensor)
shape: (bs, response_length)
- Returns: (torch.Tensor)
shape: (bs, response_length)
- Return type:
advantages
- trinity.trainer.verl.core_algos.compute_grpo_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, index: Tensor, epsilon: float = 1e-06)[source]
Compute advantage for GRPO, operating only on Outcome reward (with only one scalar reward for each response). :param token_level_rewards: (torch.Tensor)
shape: (bs, response_length)
- Parameters:
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
- (torch.Tensor)
shape: (bs, response_length)
- Returns: (torch.Tensor)
shape: (bs, response_length)
- Return type:
advantages
- trinity.trainer.verl.core_algos.compute_rloo_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, index: Tensor, epsilon: float = 1e-06)[source]
Compute advantage for RLOO based on https://arxiv.org/abs/2402.14740 :param token_level_rewards: (torch.Tensor)
shape: (bs, response_length)
- Parameters:
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
- (torch.Tensor)
shape: (bs, response_length)
- Returns: (torch.Tensor)
shape: (bs, response_length)
- Return type:
advantages
- trinity.trainer.verl.core_algos.compute_reinforce_plus_plus_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, gamma: Tensor)[source]
Compute advantage for REINFORCE++. This implementation is based on the paper: https://arxiv.org/abs/2501.03262 :param token_level_rewards: (torch.Tensor)
shape: (bs, response_length)
- Parameters:
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
- (torch.Tensor)
shape: (bs, response_length)
- Returns: (torch.Tensor)
shape: (bs, response_length)
- Return type:
advantages
- trinity.trainer.verl.core_algos.compute_remax_outcome_advantage(token_level_rewards: Tensor, reward_baselines: Tensor, eos_mask: Tensor)[source]
Compute advantage for ReMax, operating only on Outcome reward This implementation is based on the paper: https://arxiv.org/abs/2310.10505
(with only one scalar reward for each response). :param token_level_rewards: (torch.Tensor)
shape: (bs, response_length)
- Parameters:
reward_baselines – (torch.Tensor) shape: (bs,)
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
- (torch.Tensor)
shape: (bs, response_length)
- Returns: (torch.Tensor)
shape: (bs, response_length)
- Return type:
advantages
- trinity.trainer.verl.core_algos.compute_rewards(token_level_scores, old_log_prob, ref_log_prob, kl_ratio)[source]
- trinity.trainer.verl.core_algos.compute_policy_loss(old_log_prob, log_prob, eos_mask, **kwargs)[source]
Compute policy loss for PPO / OPMD / pairwise OPMD
- trinity.trainer.verl.core_algos.compute_policy_loss_dpo(log_prob, ref_log_prob, eos_mask, loss_type='sigmoid', beta=0.1, label_smoothing=0.0)[source]
Compute policy loss for DPO (Direct Preference Optimization)
Ref: https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py#L918
- Parameters:
log_prob – (torch.Tensor) The log probabilities of the chosen responses from the policy model.
ref_log_prob – (torch.Tensor) The log probabilities of the chosen responses from the reference model.
loss_type – (str) Default: “sigmoid” The type of loss function to use.
beta – (float) Default: 0.1 A temperature parameter that controls the sharpness of the preference signal. Higher values make the loss more sensitive to small differences in log probabilities.
label_smoothing – (float) Default: 0.0 A parameter to encode uncertainty about the labels. Adds a small amount of smoothing to the loss to avoid overconfident predictions.
- Returns:
a scalar torch.Tensor chosen_diff: (torch.Tensor) rejected_diff: (torch.Tensor)
- Return type:
dpo_loss
- trinity.trainer.verl.core_algos.compute_policy_loss_pairwise_opmd(old_log_prob, log_prob, token_level_scores, eos_mask, index, tau)[source]
Compute policy loss for pairwise_opmd
NOTE: NOT TESTED YET
TODO: allow using old_log_prob; for now we just discard it.
NOTE: use token_level_scores rather than token_level_rewards, because we’re not sure yet whether this algorithm is compatible with kl penalty as negative reward
- Parameters:
old_log_prob – (torch.Tensor) shape: (bs, response_length)
log_prob – (torch.Tensor) shape: (bs, response_length)
token_level_scores – (torch.Tensor) shape: (bs, response_length)
eos_mask – (torch.Tensor) shape: (bs, response_length)
index – (torch.Tensor) or None (when use_uid is False)
tau – float
- Returns:
- a scalar torch.Tensor
pairwise_opmd loss
- pg_clipfrac: (float)
a float number indicating the fraction of policy gradient loss being clipped
ppo_kl: (float) … (TODO, confirm that this is only used for logging stats)
- Return type:
opmd_loss
- trinity.trainer.verl.core_algos.compute_policy_loss_opmd(old_log_prob, log_prob, advantages, eos_mask, tau)[source]
The OPMD counterpart of verl’s original compute_policy_loss (now renamed as compute_policy_loss_ppo)
- Parameters:
old_log_prob – (torch.Tensor) shape: (bs, response_length)
log_prob – (torch.Tensor) shape: (bs, response_length)
advantages – (torch.Tensor) shape: (bs, response_length)
eos_mask – (torch.Tensor) shape: (bs, response_length)
tau – float
- Returns:
- a scalar torch.Tensor
opmd loss
- pg_clipfrac: (float)
a float number indicating the fraction of policy gradient loss being clipped
ppo_kl: (float) … (TODO, confirm that this is only used for logging stats)
- Return type:
opmd_loss
- trinity.trainer.verl.core_algos.compute_policy_loss_ppo(old_log_prob, log_prob, advantages, eos_mask, cliprange)[source]
Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122
- Parameters:
old_log_prob – (torch.Tensor) shape: (bs, response_length)
log_prob – (torch.Tensor) shape: (bs, response_length)
advantages – (torch.Tensor) shape: (bs, response_length)
eos_mask – (torch.Tensor) shape: (bs, response_length)
cliprange – (float) The clip range used in PPO. See https://arxiv.org/abs/1707.06347
- Returns:
- a scalar torch.Tensor
policy gradient loss computed via PPO
- pg_clipfrac: (float)
a float number indicating the fraction of policy gradient loss being clipped
- Return type:
pg_loss
- trinity.trainer.verl.core_algos.compute_policy_loss_sft(log_prob, eos_mask)[source]
Simple way to compute SFT loss, unified with PG loss
- Parameters:
log_prob – (torch.Tensor) shape: (bs, response_length)
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
a scalar torch.Tensor pg_clipfrac: dummy value, merely for compatibility ppo_kl: dummy value, merely for compatibility
- Return type:
sft_loss
- trinity.trainer.verl.core_algos.compute_entropy_loss(logits, eos_mask)[source]
Compute Categorical entropy loss
- Parameters:
logits – (torch.Tensor) shape: (bs, response_length, vocab_size)
eos_mask – (torch.Tensor) shape: (bs, response_length)
- Returns:
a scalar torch.Tensor
- Return type:
entropy
- trinity.trainer.verl.core_algos.compute_value_loss(vpreds, returns, values, eos_mask, cliprange_value)[source]
Compute the value loss. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1151
- Parameters:
vpreds (torch.FloatTensor) – Predicted values of the value head, shape (batch_size, response_length)
values (torch.FloatTensor) – Old values of value head, shape (batch_size, response_length)
returns – (torch.FloatTensor): Ground truth returns, shape (batch_size, response_length)
- Returns:
- a scalar (torch.FloatTensor):
value function loss
- vf_clipfrac: a float
The ratio of vf being clipped
- Return type:
vf_loss
- trinity.trainer.verl.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor [source]
Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104
- Parameters:
logprob
ref_logprob
Returns:
trinity.trainer.verl.dp_actor module
trinity.trainer.verl.fsdp_workers module
The main entry point to run the PPO algorithm
- class trinity.trainer.verl.fsdp_workers.ActorRolloutRefWorker(*args, **kwargs)[source]
Bases:
Worker
This worker can be instantiated as a standalone actor or a standalone rollout or a standalone reference policy or a hybrid engine based on the config.rollout
- set_mode(algo_type: AlgorithmType = AlgorithmType.PPO)[source]
trinity.trainer.verl.ray_trainer module
Modified from ray_trainer.py
- class trinity.trainer.verl.ray_trainer.Role(value)[source]
Bases:
Enum
To create more roles dynamically, you can subclass Role and add new members
- Actor = 0
- Rollout = 1
- ActorRollout = 2
- Critic = 3
- RefPolicy = 4
- RewardModel = 5
- ActorRolloutRef = 6
- class trinity.trainer.verl.ray_trainer.AdvantageEstimator(value)[source]
Bases:
str
,Enum
Using an enumeration class to avoid spelling errors in adv_estimator
- GAE = 'gae'
- GRPO = 'grpo'
- REINFORCE_PLUS_PLUS = 'reinforce_plus_plus'
- REMAX = 'remax'
- RLOO = 'rloo'
- class trinity.trainer.verl.ray_trainer.ResourcePoolManager(resource_pool_spec: dict[str, list[int]], mapping: dict[~trinity.trainer.verl.ray_trainer.Role, str], resource_pool_dict: dict[str, ~verl.single_controller.ray.base.RayResourcePool] = <factory>)[source]
Bases:
object
Define a resource pool specification. Resource pool will be initialized first. Mapping
- resource_pool_spec: dict[str, list[int]]
- resource_pool_dict: dict[str, RayResourcePool]
- __init__(resource_pool_spec: dict[str, list[int]], mapping: dict[~trinity.trainer.verl.ray_trainer.Role, str], resource_pool_dict: dict[str, ~verl.single_controller.ray.base.RayResourcePool] = <factory>) None
- trinity.trainer.verl.ray_trainer.apply_kl_penalty(data: DataProto, kl_ctrl: AdaptiveKLController, kl_penalty='kl')[source]
- trinity.trainer.verl.ray_trainer.compute_advantage(data: DataProto, **kwargs)[source]
Extend verl’s original compute_advantage with OPMD
- trinity.trainer.verl.ray_trainer.compute_advantage_opmd(data: DataProto, tau=1.0, opmd_baseline='mean')[source]
- trinity.trainer.verl.ray_trainer.compute_advantage_ppo(data: DataProto, adv_estimator, gamma=1.0, lam=1.0, num_repeat=1)[source]
- class trinity.trainer.verl.ray_trainer.RayPPOTrainer(config, tokenizer, role_worker_mapping: dict[~trinity.trainer.verl.ray_trainer.Role, ~typing.Type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~trinity.trainer.verl.ray_trainer.ResourcePoolManager, ray_worker_group_cls: ~verl.single_controller.ray.base.RayWorkerGroup = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None)[source]
Bases:
object
Note that this trainer runs on the driver process on a single CPU/GPU node.
- __init__(config, tokenizer, role_worker_mapping: dict[~trinity.trainer.verl.ray_trainer.Role, ~typing.Type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~trinity.trainer.verl.ray_trainer.ResourcePoolManager, ray_worker_group_cls: ~verl.single_controller.ray.base.RayWorkerGroup = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None)[source]