trinity.trainer.verl

Submodules

trinity.trainer.verl.core_algos module

Modified from core_algos.py

class trinity.trainer.verl.core_algos.KLController[source]

Bases: ABC

abstract update(current_kl, n_steps)[source]

update value

class trinity.trainer.verl.core_algos.AdaptiveKLController(init_kl_coef, target_kl, horizon)[source]

Bases: KLController

Adaptive KL controller described in the paper: https://arxiv.org/pdf/1909.08593.pdf

__init__(init_kl_coef, target_kl, horizon)[source]
update(current_kl, n_steps)[source]

update value

class trinity.trainer.verl.core_algos.FixedKLController(kl_coef)[source]

Bases: KLController

Fixed KL controller.

__init__(kl_coef)[source]
update(current_kl, n_steps)[source]

update value

trinity.trainer.verl.core_algos.get_kl_controller(kl_config)[source]
trinity.trainer.verl.core_algos.compute_opmd_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, index: Tensor, opmd_baseline: str = 'mean', tau: float = 1.0)[source]

Modified from compute_grpo_outcome_advantage

Compute advantage for OPMD, operating only on Outcome reward (with only one scalar reward for each response). :param token_level_rewards: (torch.Tensor)

shape: (bs, response_length)

Parameters:

eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

(torch.Tensor)

shape: (bs, response_length)

Returns: (torch.Tensor)

shape: (bs, response_length)

Return type:

advantages

trinity.trainer.verl.core_algos.compute_gae_advantage_return(token_level_rewards: Tensor, values: Tensor, eos_mask: Tensor, gamma: Tensor, lam: Tensor)[source]

Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py

Parameters:
  • token_level_rewards(torch.Tensor) shape: (bs, response_length)

  • values(torch.Tensor) shape: (bs, response_length)

  • eos_mask(torch.Tensor) shape: (bs, response_length). [EOS] mask. The token after [EOS] have mask zero.

  • gamma(float) discounted factor used in RL

  • lam(float) lambda value when computing Generalized Advantage Estimation (https://arxiv.org/abs/1506.02438)

Returns:

(torch.Tensor)

shape: (bs, response_length)

Returns: (torch.Tensor)

shape: (bs, response_length)

Return type:

advantages

trinity.trainer.verl.core_algos.compute_grpo_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, index: Tensor, epsilon: float = 1e-06)[source]

Compute advantage for GRPO, operating only on Outcome reward (with only one scalar reward for each response). :param token_level_rewards: (torch.Tensor)

shape: (bs, response_length)

Parameters:

eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

(torch.Tensor)

shape: (bs, response_length)

Returns: (torch.Tensor)

shape: (bs, response_length)

Return type:

advantages

trinity.trainer.verl.core_algos.compute_rloo_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, index: Tensor, epsilon: float = 1e-06)[source]

Compute advantage for RLOO based on https://arxiv.org/abs/2402.14740 :param token_level_rewards: (torch.Tensor)

shape: (bs, response_length)

Parameters:

eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

(torch.Tensor)

shape: (bs, response_length)

Returns: (torch.Tensor)

shape: (bs, response_length)

Return type:

advantages

trinity.trainer.verl.core_algos.compute_reinforce_plus_plus_outcome_advantage(token_level_rewards: Tensor, eos_mask: Tensor, gamma: Tensor)[source]

Compute advantage for REINFORCE++. This implementation is based on the paper: https://arxiv.org/abs/2501.03262 :param token_level_rewards: (torch.Tensor)

shape: (bs, response_length)

Parameters:

eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

(torch.Tensor)

shape: (bs, response_length)

Returns: (torch.Tensor)

shape: (bs, response_length)

Return type:

advantages

trinity.trainer.verl.core_algos.compute_remax_outcome_advantage(token_level_rewards: Tensor, reward_baselines: Tensor, eos_mask: Tensor)[source]

Compute advantage for ReMax, operating only on Outcome reward This implementation is based on the paper: https://arxiv.org/abs/2310.10505

(with only one scalar reward for each response). :param token_level_rewards: (torch.Tensor)

shape: (bs, response_length)

Parameters:
  • reward_baselines(torch.Tensor) shape: (bs,)

  • eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

(torch.Tensor)

shape: (bs, response_length)

Returns: (torch.Tensor)

shape: (bs, response_length)

Return type:

advantages

trinity.trainer.verl.core_algos.compute_rewards(token_level_scores, old_log_prob, ref_log_prob, kl_ratio)[source]
trinity.trainer.verl.core_algos.compute_policy_loss(old_log_prob, log_prob, eos_mask, **kwargs)[source]

Compute policy loss for PPO / OPMD / pairwise OPMD

trinity.trainer.verl.core_algos.compute_policy_loss_dpo(log_prob, ref_log_prob, eos_mask, loss_type='sigmoid', beta=0.1, label_smoothing=0.0)[source]

Compute policy loss for DPO (Direct Preference Optimization)

Ref: https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py#L918

Parameters:
  • log_prob(torch.Tensor) The log probabilities of the chosen responses from the policy model.

  • ref_log_prob(torch.Tensor) The log probabilities of the chosen responses from the reference model.

  • loss_type(str) Default: “sigmoid” The type of loss function to use.

  • beta(float) Default: 0.1 A temperature parameter that controls the sharpness of the preference signal. Higher values make the loss more sensitive to small differences in log probabilities.

  • label_smoothing(float) Default: 0.0 A parameter to encode uncertainty about the labels. Adds a small amount of smoothing to the loss to avoid overconfident predictions.

Returns:

a scalar torch.Tensor chosen_diff: (torch.Tensor) rejected_diff: (torch.Tensor)

Return type:

dpo_loss

trinity.trainer.verl.core_algos.compute_policy_loss_pairwise_opmd(old_log_prob, log_prob, token_level_scores, eos_mask, index, tau)[source]

Compute policy loss for pairwise_opmd

NOTE: NOT TESTED YET

TODO: allow using old_log_prob; for now we just discard it.

NOTE: use token_level_scores rather than token_level_rewards, because we’re not sure yet whether this algorithm is compatible with kl penalty as negative reward

Parameters:
  • old_log_prob(torch.Tensor) shape: (bs, response_length)

  • log_prob(torch.Tensor) shape: (bs, response_length)

  • token_level_scores(torch.Tensor) shape: (bs, response_length)

  • eos_mask(torch.Tensor) shape: (bs, response_length)

  • index(torch.Tensor) or None (when use_uid is False)

  • taufloat

Returns:

a scalar torch.Tensor

pairwise_opmd loss

pg_clipfrac: (float)

a float number indicating the fraction of policy gradient loss being clipped

ppo_kl: (float) … (TODO, confirm that this is only used for logging stats)

Return type:

opmd_loss

trinity.trainer.verl.core_algos.compute_policy_loss_opmd(old_log_prob, log_prob, advantages, eos_mask, tau)[source]

The OPMD counterpart of verl’s original compute_policy_loss (now renamed as compute_policy_loss_ppo)

Parameters:
  • old_log_prob(torch.Tensor) shape: (bs, response_length)

  • log_prob(torch.Tensor) shape: (bs, response_length)

  • advantages(torch.Tensor) shape: (bs, response_length)

  • eos_mask(torch.Tensor) shape: (bs, response_length)

  • taufloat

Returns:

a scalar torch.Tensor

opmd loss

pg_clipfrac: (float)

a float number indicating the fraction of policy gradient loss being clipped

ppo_kl: (float) … (TODO, confirm that this is only used for logging stats)

Return type:

opmd_loss

trinity.trainer.verl.core_algos.compute_policy_loss_ppo(old_log_prob, log_prob, advantages, eos_mask, cliprange)[source]

Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122

Parameters:
  • old_log_prob(torch.Tensor) shape: (bs, response_length)

  • log_prob(torch.Tensor) shape: (bs, response_length)

  • advantages(torch.Tensor) shape: (bs, response_length)

  • eos_mask(torch.Tensor) shape: (bs, response_length)

  • cliprange – (float) The clip range used in PPO. See https://arxiv.org/abs/1707.06347

Returns:

a scalar torch.Tensor

policy gradient loss computed via PPO

pg_clipfrac: (float)

a float number indicating the fraction of policy gradient loss being clipped

Return type:

pg_loss

trinity.trainer.verl.core_algos.compute_policy_loss_sft(log_prob, eos_mask)[source]

Simple way to compute SFT loss, unified with PG loss

Parameters:
  • log_prob(torch.Tensor) shape: (bs, response_length)

  • eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

a scalar torch.Tensor pg_clipfrac: dummy value, merely for compatibility ppo_kl: dummy value, merely for compatibility

Return type:

sft_loss

trinity.trainer.verl.core_algos.compute_entropy_loss(logits, eos_mask)[source]

Compute Categorical entropy loss

Parameters:
  • logits(torch.Tensor) shape: (bs, response_length, vocab_size)

  • eos_mask(torch.Tensor) shape: (bs, response_length)

Returns:

a scalar torch.Tensor

Return type:

entropy

trinity.trainer.verl.core_algos.compute_value_loss(vpreds, returns, values, eos_mask, cliprange_value)[source]

Compute the value loss. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1151

Parameters:
  • vpreds (torch.FloatTensor) – Predicted values of the value head, shape (batch_size, response_length)

  • values (torch.FloatTensor) – Old values of value head, shape (batch_size, response_length)

  • returns – (torch.FloatTensor): Ground truth returns, shape (batch_size, response_length)

Returns:

a scalar (torch.FloatTensor):

value function loss

vf_clipfrac: a float

The ratio of vf being clipped

Return type:

vf_loss

trinity.trainer.verl.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor[source]

Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104

Parameters:
  • logprob

  • ref_logprob

Returns:

trinity.trainer.verl.dp_actor module

trinity.trainer.verl.fsdp_workers module

The main entry point to run the PPO algorithm

trinity.trainer.verl.fsdp_workers.create_device_mesh(world_size, fsdp_size)[source]
trinity.trainer.verl.fsdp_workers.get_sharding_strategy(device_mesh)[source]
class trinity.trainer.verl.fsdp_workers.ActorRolloutRefWorker(*args, **kwargs)[source]

Bases: Worker

This worker can be instantiated as a standalone actor or a standalone rollout or a standalone reference policy or a hybrid engine based on the config.rollout

__init__(config: DictConfig, role: str)[source]
init_model()[source]
setup_weight_sync_group()[source]
sync_weight()[source]
set_mode(algo_type: AlgorithmType = AlgorithmType.PPO)[source]
update_actor(data: DataProto)[source]
generate_sequences(prompts: DataProto)[source]
compute_log_prob(data: DataProto)[source]
compute_ref_log_prob(data: DataProto)[source]
save_checkpoint(local_path, hdfs_path=None, global_step=0, max_ckpt_to_keep=None)[source]
load_checkpoint(local_path, hdfs_path=None, del_local_after_load=False)[source]
clear_optimizer_state()[source]
class trinity.trainer.verl.fsdp_workers.CriticWorker(*args, **kwargs)[source]

Bases: Worker

__init__(config)[source]
init_model()[source]
compute_values(data: DataProto)[source]
update_critic(data: DataProto)[source]
save_checkpoint(local_path, hdfs_path=None, global_step=0, max_ckpt_to_keep=None)[source]
load_checkpoint(local_path, hdfs_path=None, del_local_after_load=True)[source]
clear_optimizer_state()[source]
class trinity.trainer.verl.fsdp_workers.RewardModelWorker(*args, **kwargs)[source]

Bases: Worker

Note that we only implement the reward model that is subclass of AutoModelForTokenClassification.

__init__(config)[source]
init_model()[source]
compute_rm_score(data: DataProto)[source]

trinity.trainer.verl.ray_trainer module

Modified from ray_trainer.py

class trinity.trainer.verl.ray_trainer.Role(value)[source]

Bases: Enum

To create more roles dynamically, you can subclass Role and add new members

Actor = 0
Rollout = 1
ActorRollout = 2
Critic = 3
RefPolicy = 4
RewardModel = 5
ActorRolloutRef = 6
class trinity.trainer.verl.ray_trainer.AdvantageEstimator(value)[source]

Bases: str, Enum

Using an enumeration class to avoid spelling errors in adv_estimator

GAE = 'gae'
GRPO = 'grpo'
REINFORCE_PLUS_PLUS = 'reinforce_plus_plus'
REMAX = 'remax'
RLOO = 'rloo'
class trinity.trainer.verl.ray_trainer.ResourcePoolManager(resource_pool_spec: dict[str, list[int]], mapping: dict[~trinity.trainer.verl.ray_trainer.Role, str], resource_pool_dict: dict[str, ~verl.single_controller.ray.base.RayResourcePool] = <factory>)[source]

Bases: object

Define a resource pool specification. Resource pool will be initialized first. Mapping

resource_pool_spec: dict[str, list[int]]
mapping: dict[Role, str]
resource_pool_dict: dict[str, RayResourcePool]
create_resource_pool()[source]
get_resource_pool(role: Role) RayResourcePool[source]

Get the resource pool of the worker_cls

get_n_gpus() int[source]

Get the number of gpus in this cluster.

__init__(resource_pool_spec: dict[str, list[int]], mapping: dict[~trinity.trainer.verl.ray_trainer.Role, str], resource_pool_dict: dict[str, ~verl.single_controller.ray.base.RayResourcePool] = <factory>) None
trinity.trainer.verl.ray_trainer.apply_kl_penalty(data: DataProto, kl_ctrl: AdaptiveKLController, kl_penalty='kl')[source]
trinity.trainer.verl.ray_trainer.compute_response_mask(data: DataProto)[source]
trinity.trainer.verl.ray_trainer.compute_advantage(data: DataProto, **kwargs)[source]

Extend verl’s original compute_advantage with OPMD

trinity.trainer.verl.ray_trainer.compute_advantage_opmd(data: DataProto, tau=1.0, opmd_baseline='mean')[source]
trinity.trainer.verl.ray_trainer.compute_advantage_ppo(data: DataProto, adv_estimator, gamma=1.0, lam=1.0, num_repeat=1)[source]
class trinity.trainer.verl.ray_trainer.RayPPOTrainer(config, tokenizer, role_worker_mapping: dict[~trinity.trainer.verl.ray_trainer.Role, ~typing.Type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~trinity.trainer.verl.ray_trainer.ResourcePoolManager, ray_worker_group_cls: ~verl.single_controller.ray.base.RayWorkerGroup = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None)[source]

Bases: object

Note that this trainer runs on the driver process on a single CPU/GPU node.

__init__(config, tokenizer, role_worker_mapping: dict[~trinity.trainer.verl.ray_trainer.Role, ~typing.Type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~trinity.trainer.verl.ray_trainer.ResourcePoolManager, ray_worker_group_cls: ~verl.single_controller.ray.base.RayWorkerGroup = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None)[source]
init_workers()[source]

Init resource pool and worker group

fit()[source]

The training loop of PPO. The driver process only need to call the compute functions of the worker group through RPC to construct the PPO dataflow. The light-weight advantage computation is done on the driver process.

Module contents