Align configuration with veRL#

This guide provides guidance for users familiar with veRL to align the parameters and metrics in Trinity-RFT with the ones in veRL.

Trinity-RFT uses veRL as the training backend (trainer), including the actor, reference, and critic models. The explorer module in Trinity-RFT is implemented based on vllm, replacing veRL’s native rollout engine. Besides, Trinity-RFT introduces a new module buffer to enhance RFT’s full-lifecycle data pipeline, which can be understood as a further enhancement of veRL’s RL dataset and DataProto.

Parameter Mapping#

The core parameters in veRL are divided into these categories: algorithm, data, actor_rollout_ref, critic, reward_model, and trainer. Trinity-RFT divides massive parameters of reinforcement fine-tuning into several parts according to their functions, e.g., algorithm, model, buffer, explorer, trainer, monitor, synchronizer, and cluster.

Roughly speaking, the parameters in veRL are mapped to the following modules in Trinity-RFT:

Configuration	veRL	Trinity-RFT
Algorithm, e.g., advantage function	`algorithm`	`algorithm`
Training and evaluation tasksets	`data`	`buffer.explorer_input`
Batch size (💡 explained later)	`data.train_batch_size` and `actor_rollout_ref.actor.ppo_mini_batch_size`	`buffer.batch_size` and `buffer.train_batch_size`
Actor	`actor_rollout_ref.actor`	`model` and `trainer`
Rollout	`actor_rollout_ref.rollout`	`explorer.rollout_model`
Critic	`critic`	`trainer.trainer_config.critic`
Reward model	`reward_model`	`explorer.auxiliary_models`
Some global configurations	`trainer`	`monitor`, `synchronizer`, `cluster`, etc

In the following, we show how to map the parameters in veRL to the ones in Trinity-RFT. Please refer to the documentation for the detailed parameter configuration of Trinity-RFT.

Note

To match the default training setup of veRL, we set synchronizer.sync_style=fixed and synchronizer.sync_offset=0 in Trinity-RFT.

Algorithm#

veRL	Trinity-RFT	Note
`algorithm.adv_estimator`	`algorithm.advantage_fn`	Pass parameters with `algorithm.advantage_fn_args`
`algorithm.gamma`	`algorithm.advantage_fn_args.gamma`	Along with `algorithm.advantage_fn: ppo/reinforceplusplus`
`algorithm.lam`	`algorithm.advantage_fn_args.lam`	Along with `algorithm.advantage_fn: ppo`
`algorithm.use_kl_in_reward`	`algorithm.kl_penalty_fn`	Disable KL in reward by setting `algorithm.kl_penalty_fn=none`
`algorithm.kl_penalty`	`algorithm.kl_penalty_fn`	Choose from `k2`, `low_var_kl`, etc
`algorithm.kl_ctrl.kl_coef`	`algorithm.kl_penalty_fn_args.kl_coef`	-

💡 Detailed explanation:

Before using args of advantage function or policy loss function (e.g., algorithm.kl_penalty_fn_args), a good practice is to check the source code to ensure these parameters can be processed by the corresponding function properly.

Data#

veRL	Trinity-RFT	Note
`data.train_files`	`buffer.explorer_input.taskset.path` or `buffer.explorer_input.tasksets[i].path`	-
`data.val_files`	`buffer.explorer_input.eval_tasksets[i].path`	-
`data.prompt_key`	`buffer.explorer_input.taskset.format.prompt_key`	Taskset-specific
`data.response_key`	`buffer.explorer_input.taskset.format.response_key`	Taskset-specific
`data.train_batch_size`	`buffer.batch_size` * `synchronizer.sync_interval`	The number of tasks to be explored
`data.val_batch_size`	`buffer.batch_size`	Deprecated in veRL
`data.max_prompt_length`	`model.max_prompt_tokens`	-
`data.max_response_length`	`model.max_response_tokens`	-
`data.filter_overlong_prompts`	`model.enable_prompt_truncation`	Explained later
`data.truncation`	-	Equivalent to `right`
`data.shuffle`	`buffer.explorer_input.taskset.task_selector.selector_type:shuffle`	Taskset-specific

💡 Detailed explanation:

The note taskset-specific means you can set different parameters for each training or evaluation task in buffer.explorer_input.tasksets[i] or buffer.explorer_input.eval_tasksets[i].
For the parameters related to batch size, Trinity-RFT uses buffer.batch_size to control the number of tasks to be explored in each exploration step, and buffer.train_batch_size to control the number of tasks used in each gradient descent step. In most cases, controlling the following parameters can ensure the same effect as veRL:
- buffer.batch_size in Trinity-RFT = actor_rollout_ref.actor.ppo_mini_batch_size in veRL
- buffer.train_batch_size in Trinity-RFT (automatically) = actor_rollout_ref.rollout.n * actor_rollout_ref.actor.ppo_mini_batch_size in veRL
- synchronizer.sync_interval in Trinity-RFT = data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size in veRL
- Do not set ppo_mini_batch_size, which is automatically set to match the effect of veRL, although the values may not be the same.
If you want to filter the overlong prompts, you can set model.enable_prompt_truncation=True in Trinity-RFT. In this case, the corresponding experiences will not be counted in loss computation, and thus truncation side does not matter anymore.

Actor, Rollout, and Critic#

This section includes the parameters for the actor and the rollout. For easy understanding, you may think the actor in veRL (actor_rollout_ref.actor) as the trainer in Trinity-RFT (trainer), and the rollout (actor_rollout_ref.rollout) as the explorer (explorer.rollout_model).

Note

Any parameter in actor_rollout_ref.rollout in Trinity-RFT is not effective; please set them in other fields properly.

For advanced training configuration of veRL you can set these up in the field of trainer.trainer_config. For example,actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu in veRL is equivalent to trainer.trainer_config.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu in Trinity-RFT. If you want to setup the parameters in the trainer.trainer_config dictionary, please read the source code in trinity/common/verl_config.py carefully!

veRL	Trinity-RFT	Note
`actor_rollout_ref.model.path`	`model.model_path`	-
`actor_rollout_ref.actor.optim`	`algorithm.optimizer`	Such as `lr` and `weight_decay`
`actor_rollout_ref.rollout.n`	`algorithm.repeat_times`	Eval taskset-specific: `eval_tasksets[i].repeat_times`
`actor_rollout_ref.actor.ppo_mini_batch_size`	`buffer.batch_size`	The number of tasks to be explored in each exploration step
`actor_rollout_ref.actor.use_dynamic_bsz`	`trainer.use_dynamic_bsz`	-
`actor_rollout_ref.actor.ppo_max_token_len_per_gpu`	`trainer.max_token_len_per_gpu`	-
`actor_rollout_ref.actor.ulysses_sequence_parallel_size`	`trainer.ulysses_sequence_parallel_size`	The sequence parallel size for the actor
`actor_rollout_ref.actor.grad_clip`	`trainer.grad_clip`	The gradient clip value for the actor
`actor_rollout_ref.actor.use_kl_loss`	`algorithm.kl_loss_fn`	If set to `none`, the KL divergence loss will not be computed
`actor_rollout_ref.rollout.gpu_memory_utilization`	`explorer.rollout_model.gpu_memory_utilization`	-
`actor_rollout_ref.rollout.temperature`	`model.temperature`	Can be taskset-specific, like `buffer.explorer_input.taskset.rollout_args.temperature`
`actor_rollout_ref.rollout.top_p`	`model.top_p`	Can be taskset-specific
`actor_rollout_ref.rollout.top_k`	`model.top_k`	Can be taskset-specific
`actor_rollout_ref.rollout.tensor_model_parallel_size`	`explorer.rollout_model.tensor_parallel_size`	-
`actor_rollout_ref.rollout.val_kwargs`	`buffer.explorer_input.eval_tasksets[i]`	Taskset-specific
`critic.model.path`	`model.critic_model_path`	Defaults to `model.model_path`

💡 Detailed explanation:

The note can be taskset-specific (take temperature as an example) means you can set model.temperature for all the tasksets, or set different values for each task in buffer.explorer_input.taskset.rollout_args.temperature or buffer.explorer_input.eval_tasksets[i].rollout_args.temperature. A concrete example is as follows:

buffer:
  explorer_input:
    eval_tasksets:
      - name: AIME2024
        storage_type: file
        path: HuggingFaceH4/aime_2024
        split: 'train'
        repeat_times: 32
        format:
          prompt_key: 'question'
          response_key: 'answer'
        rollout_args:
          temperature: 1.0
          top_p: 0.7

Reward Model#

Trinity-RFT supports the taskset-specific reward functions as well as the reward models. For custom reward functions, you can set buffer.explorer_input.default_reward_fn_type to select the corresponding reward function; you can also set explorer.auxiliary_models as reward model and use them within your workflow. For example,

buffer:
  explorer_input:
    default_reward_fn_type: 'custom_reward'
explorer:
  auxiliary_models:
    - model_path: Qwen/Qwen3-30B-A3B-Instruct-2507
      engine_num: 1
      tensor_parallel_size: 2
      enable_thinking: false
      max_prompt_tokens: 19456
      max_response_tokens: 1024
      max_model_len: 20480

Please refer to the configuration and workflow with LLM-as-a-judge for more details.

Trainer#

veRL	Trinity-RFT	Note
`trainer.logger`	`monitor.monitor_type`	Support a chosen type and (no need to set) `console`
`trainer.project_name`	`project`	-
`trainer.experiment_name`	`name`	-
`trainer.default_local_dir`	`checkpoint_root_dir`	Checkpoint is saved in `<checkpoint_root_dir>/<project>/<name>/`
`trainer.n_gpus_per_node`	`cluster.gpu_per_node`	-
`trainer.nnodes`	`cluster.node_num`	-
`trainer.save_freq`	`trainer.save_interval`	-
`trainer.test_freq`	`explorer.eval_interval`	-
`trainer.total_epochs`	`buffer.total_epochs`	-
`trainer.total_training_steps`	`buffer.total_steps` and `trainer.total_steps`	If not None, `buffer.total_epochs` will be ignored
`trainer.critic_warmup`	`trainer.trainer_config.trainer.critic_warmup`	-
`trainer.val_before_train`	`explorer.eval_on_startup`	-
`trainer.resume_mode`	`continue_from_checkpoint`	Explained later
`trainer.resume_from_path`	-	Explained later

💡 Detailed explanation:

If you want to resume training from a checkpoint, you can set continue_from_checkpoint to True and the training will start from the latest checkpoint in the checkpoint path <checkpoint_root_dir>/<project>/<name>/ (if any).

GPU Resource Allocation#

In Trinity-RFT, the GPU resource is allocated to the explorer, auxiliary models (if any), and trainer manually.

There are total cluster.node_num nodes, and each node has cluster.gpu_per_node GPUs.
The number of GPUs for the explorer is explorer.rollout_model.engine_num * explorer.rollout_model.tensor_parallel_size.
The number of GPUs for auxiliary models is the sum of explorer.auxiliary_models[i].engine_num * explorer.auxiliary_models[i].tensor_parallel_size.
The remaining GPUs are for the trainer.

Metrics Mapping#

Why do we see two runs for each experiment?#

In Trinity-RFT, the explorer is responsible for the rollout process, while the trainer is responsible for the training process. Metrics from these two processes are calculated independently and uploaded to the monitor as separate runs. This is why you will see two runs for each experiment, distinguished by the “_explorer” or “_trainer” suffix.

Why are some metrics different from veRL?#

Trinity-RFT uses vllm as the rollout engine and veRL as the training backend. Due to precision differences between these frameworks, the log probabilities computed on the given tokens may differ. As a result, some metrics (e.g., actor/ppo_kl and actor/pg_clipfrac) may differ from those observed in veRL. However, when using the same parameters with veRL, these differences are expected to be small.

Example: PPO Training#

We transfer a PPO training example run_qwen2-7b_rm.sh from veRL to Trinity-RFT.

The configuration file of veRL is as follows:

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
gsm8k_test_path=$HOME/data/gsm8k/test.parquet
math_train_path=$HOME/data/math/train.parquet
math_test_path=$HOME/data/math/test.parquet

train_files="['$gsm8k_train_path', '$math_train_path']"
test_files="['$gsm8k_test_path', '$math_test_path']"

# prepare model ckpt
huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir $HOME/models/Qwen2-7B-Instruct &
huggingface-cli download sfairXC/FsfairX-LLaMA3-RM-v0.1 --local-dir $HOME/models/FsfairX-LLaMA3-RM-v0.1 &
wait

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=True \
    actor_rollout_ref.model.path="$HOME/models/Qwen2-7B-Instruct" \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.optim.lr_warmup_steps_ratio=0.05 \
    critic.model.path="$HOME/models/Qwen2-7B-Instruct" \
    critic.model.enable_gradient_checkpointing=True \
    critic.ppo_micro_batch_size_per_gpu=32 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    reward_model.enable=True \
    reward_model.model.path="$HOME/models/FsfairX-LLaMA3-RM-v0.1" \
    reward_model.model.use_remove_padding=True \
    reward_model.model.fsdp_config.param_offload=True \
    reward_model.micro_batch_size_per_gpu=32 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.val_before_train=False \
    trainer.experiment_name='Qwen2-7B-Instruct_hybrid_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

The corresponding configuration of Trinity-RFT (ppo_example.yaml) is as follows:

project: verl_example
name: Qwen2-7B-Instruct_hybrid_rm
checkpoint_root_dir: ./checkpoints
algorithm:
  algorithm_type: ppo
  repeat_times: 1
  optimizer:
    lr: 1e-6
    lr_warmup_steps_ratio: 0.1  # actor_rollout_ref.actor.optim.lr_warmup_steps_ratio
  advantage_fn: ppo  # algorithm.adv_estimator=gae
  kl_penalty_fn: none  # algorithm.use_kl_in_reward=False
  kl_loss_fn: none  # actor_rollout_ref.actor.use_kl_loss=False

model:
  model_path: ${oc.env:HOME}/models/Qwen2-7B-Instruct
  critic_model_path: ${oc.env:HOME}/models/Qwen2-7B-Instruct  # critic.model.path
  max_prompt_tokens: 1024  # data.max_prompt_length
  max_response_tokens: 512  # data.max_response_length
  enable_prompt_truncation: true  # data.filter_overlong_prompts=True

cluster:
  node_num: 1  # trainer.nnodes
  gpu_per_node: 8  # trainer.n_gpus_per_node

buffer:
  total_epochs: 15  # trainer.total_epochs
  batch_size: 256  # actor_rollout_ref.actor.ppo_mini_batch_size
  train_batch_size: 256  # actor_rollout_ref.actor.ppo_mini_batch_size * actor_rollout_ref.rollout.n=256*1=256
  explorer_input:
    tasksets:
      - name: gsm8k
        storage_type: file
        path: ${oc.env:HOME}/data/gsm8k
        split: train
        format:
          prompt_key: prompt  # Check the dataset format
          response_key: answer # Check the dataset format
      - name: math
        storage_type: file
        path: ${oc.env:HOME}/data/math
        split: train
        format:
          prompt_key: prompt  # Check the dataset format
          response_key: answer # Check the dataset format
        rollout_args:
          temperature: 1.0
    eval_tasksets:
      - name: gsm8k_eval
        storage_type: file
        path: ${oc.env:HOME}/data/gsm8k
        split: test
        format:
          prompt_key: prompt  # Check the dataset format
          response_key: answer # Check the dataset format
      - name: math_eval
        storage_type: file
        path: ${oc.env:HOME}/data/math
        split: test
        format:
          prompt_key: prompt  # Check the dataset format
          response_key: answer # Check the dataset format

explorer:
  eval_interval: 5  # trainer.test_freq
  eval_on_startup: false  # trainer.val_before_train=False
  rollout_model:
    engine_num: 2 # The number of GPUs for the rollout model
    tensor_parallel_size: 1  # actor_rollout_ref.rollout.tensor_model_parallel_size
    gpu_memory_utilization: 0.6  # actor_rollout_ref.rollout.gpu_memory_utilization
  auxiliary_models:  # reward_model configuration
    - model_path: ${oc.env:HOME}/models/FsfairX-LLaMA3-RM-v0.1
      engine_num: 2 # The number of GPUs for the reward model
      tensor_parallel_size: 1

synchronizer:
  sync_style: fixed
  sync_offset: 1
  sync_interval: 4  # sync_interval = data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size
  sync_timeout: 1200

trainer:
  save_interval: 20  # trainer.save_freq
  trainer_config:
    actor_rollout_ref:
      model:
        use_remove_padding: true  # actor_rollout_ref.model.use_remove_padding
        enable_gradient_checkpointing: true  # actor_rollout_ref.model.enable_gradient_checkpointing
      actor:
        ppo_micro_batch_size_per_gpu: 16  # actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu
        fsdp_config:
          param_offload: false  # actor_rollout_ref.actor.fsdp_config.param_offload
          optimizer_offload: false  # actor_rollout_ref.actor.fsdp_config.optimizer_offload
      rollout:
        log_prob_micro_batch_size_per_gpu: 16  # actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu
    critic:
      model:
        use_remove_padding: true  # critic.model.use_remove_padding
        enable_gradient_checkpointing: true  # critic.model.enable_gradient_checkpointing
        fsdp_config:
          param_offload: false  # critic.model.fsdp_config.param_offload
          optimizer_offload: false  # critic.model.fsdp_config.optimizer_offload
      optim:
        lr: 1e-5  # critic.optim.lr
        lr_warmup_steps_ratio: 0.05  # critic.optim.lr_warmup_steps_ratio
      ppo_micro_batch_size_per_gpu: 32  # critic.ppo_micro_batch_size_per_gpu
    trainer:
      critic_warmup: 0  # trainer.critic_warmup

monitor:
  monitor_type: wandb  # trainer.logger='["console","wandb"]' - wandb is the set value, console is default

The command to run this example is:

trinity run --config ppo_example.yaml

Example: GRPO Training#

We transfer a GRPO training example run_deepseek7b_llm_seq_balance.sh from veRL to Trinity-RFT.

The configuration file of veRL is as follows:

set -x

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    data.train_batch_size=1024 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='deepseek_llm_7b_function_rm_seq_packing' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

The corresponding configuration of Trinity-RFT (grpo_example.yaml) is as follows:

project: verl_grpo_example_gsm8k
name: deepseek_llm_7b_function_rm_seq_packing
checkpoint_root_dir: ./checkpoints
algorithm:
  algorithm_type: grpo
  repeat_times: 8  # actor_rollout_ref.rollout.n=8
  optimizer:
    lr: 1e-6  # actor_rollout_ref.actor.optim.lr
  advantage_fn: grpo  # algorithm.adv_estimator=grpo
  kl_penalty_fn: none  # algorithm.use_kl_in_reward=False
  kl_loss_fn: low_var_kl  # actor_rollout_ref.actor.kl_loss_type=low_var_kl
  kl_loss_fn_args:
    kl_coef: 0.001  # actor_rollout_ref.actor.kl_loss_coef
  entropy_loss_fn_args:
    entropy_coef: 0  # actor_rollout_ref.actor.entropy_coeff=0

model:
  model_path: deepseek-ai/deepseek-llm-7b-chat  # actor_rollout_ref.model.path
  max_prompt_tokens: 512  # data.max_prompt_length
  max_response_tokens: 512  # data.max_response_length
  enable_prompt_truncation: true  # data.filter_overlong_prompts=True

cluster:
  node_num: 1  # trainer.nnodes
  gpu_per_node: 8  # trainer.n_gpus_per_node

buffer:
  total_epochs: 15  # trainer.total_epochs
  batch_size: 256  # actor_rollout_ref.actor.ppo_mini_batch_size
  train_batch_size: 2048  # actor_rollout_ref.actor.ppo_mini_batch_size * actor_rollout_ref.rollout.n=256*8=2048
  explorer_input:
    tasksets:
      - name: gsm8k
        storage_type: file
        path: ${oc.env:HOME}/data/gsm8k
        split: train
        format:
          prompt_key: prompt  # Check the dataset format
          response_key: answer  # Check the dataset format
    eval_tasksets:
      - name: gsm8k_eval
        storage_type: file
        path: ${oc.env:HOME}/data/gsm8k
        split: test
        format:
          prompt_key: prompt  # Check the dataset format
          response_key: answer  # Check the dataset format

explorer:
  eval_interval: 5  # trainer.test_freq
  rollout_model:
    engine_num: 1
    tensor_parallel_size: 2  # actor_rollout_ref.rollout.tensor_model_parallel_size
    gpu_memory_utilization: 0.6  # actor_rollout_ref.rollout.gpu_memory_utilization

synchronizer:
  sync_style: fixed
  sync_offset: 1
  sync_interval: 4  # data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size in veRL
  sync_timeout: 1200

trainer:
  save_interval: 20  # trainer.save_freq
  use_dynamic_bsz: true  # actor_rollout_ref.actor.use_dynamic_bsz=True
  max_token_len_per_gpu: 24000  # actor_rollout_ref.actor.ppo_max_token_len_per_gpu
  trainer_config:
    actor_rollout_ref:
      model:
        use_remove_padding: true  # actor_rollout_ref.model.use_remove_padding=True
        enable_gradient_checkpointing: true  # actor_rollout_ref.model.enable_gradient_checkpointing=True
      actor:
        fsdp_config:
          param_offload: false  # actor_rollout_ref.actor.fsdp_config.param_offload=False
          optimizer_offload: false  # actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
      ref:
        fsdp_config:
          param_offload: true  # actor_rollout_ref.ref.fsdp_config.param_offload=True
    trainer:
      critic_warmup: 0  # trainer.critic_warmup=0

monitor:
  monitor_type: wandb  # trainer.logger='["console","wandb"]' - wandb is extracted, console is default

The command to run this example is:

trinity run --config grpo_example.yaml

Align configuration with veRL

Contents

Align configuration with veRL#

Parameter Mapping#

Algorithm#

Data#

Actor, Rollout, and Critic#

Reward Model#

Trainer#

GPU Resource Allocation#

Metrics Mapping#

Why do we see two runs for each experiment?#

Why are some metrics different from veRL?#

Example: PPO Training#

Example: GRPO Training#