Megatron-LM Backend#
This guide walks you through how to train models using Megatron-LM in a clear way.
Note
This guide assumes you have already set up your environment from source code following Installation. If you haven’t done so, please refer to that guide first.
Step 1: Install Megatron-LM Support#
Install the project in editable mode with Megatron support:
pip install -e ".[megatron]"
# for uv
# uv sync -extra megatron
Then, install NVIDIA’s Apex library for mixed-precision training:
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
--config-settings "--build-option=--cpp_ext" \
--config-settings "--build-option=--cuda_ext" \
--resume-retries 10 git+https://github.com/NVIDIA/apex.git
Alternative: Use Docker#
We provide a Docker setup to simplify environment management.
Build the Docker Image#
Trinity-RFT provides a dedicated Dockerfile for Megatron-LM located at scripts/docker_for_megatron/Dockerfile
. You can build the image using the following command:
docker build -f scripts/docker_for_megatron/Dockerfile -t trinity-rft-megatron:latest .
💡 You can customize the Dockerfile before building — for example, to add pip mirrors or set API keys.
Run the Container#
docker run -it \
--gpus all \
--shm-size="64g" \
--rm \
-v $PWD:/workspace \
-v <your_data_and_checkpoints_path>:/data \
trinity-rft-megatron:latest
Replace <your_data_and_checkpoints_path>
with the actual path on your machine where datasets and model checkpoints are stored.
Step 2: Configure and Run Training#
Most configuration settings are covered in the Quick Start Guide. Here, we’ll focus only on Megatron-LM-specific settings.
Megatron Configuration Example#
Below is an example of how to configure the actor, reference model, and critic to use Megatron-LM:
actor_rollout_ref:
...
actor:
strategy: megatron # Kept for backward compatibility
megatron:
# Model parallelism settings
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
# Offloading (set to false unless you're memory-constrained)
param_offload: false
grad_offload: false
optimizer_offload: false
# Use mBridge for parameter import/export (optional)
use_mbridge: false
# Use Megatron checkpoint
use_dist_checkpointing: false
dist_checkpointing_path: null
# Recomputation settings (helps save memory during training)
override_transformer_config:
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
...
ref:
megatron:
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
param_offload: false
grad_offload: false
optimizer_offload: false
use_mbridge: false
use_dist_checkpointing: false
dist_checkpointing_path: null
override_transformer_config:
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
...
critic:
strategy: megatron
megatron:
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
param_offload: false
grad_offload: false
optimizer_offload: false
use_mbridge: false
use_dist_checkpointing: false
dist_checkpointing_path: null
override_transformer_config:
recompute_granularity: full
recompute_method: uniform
recompute_num_layers: 1
...
Training Mixture-of-Experts (MoE) Models#
If you’re training an MoE model like Qwen/Qwen3-30B-A3B, you’ll need to take one of the following two approaches to ensure it works properly:
Use MBridge (Recommended): Simply set
use_mbridge: true
in your configuration file. This enables the necessary support for MoE models directly.Convert the model manually: If you prefer not to use MBridge, set
use_mbridge: false
. Before training, you must first convert your Hugging Face model to the MCore format using the Hugging Face to MCore converter from the verl repository. After conversion, update your config with:use_dist_checkpointing: true
dist_checkpointing_path: /PATH/TO/CONVERTED/MODEL/
⚠️ Important: If you skip both steps, the MoE model may fail to load or train correctly. Make sure to follow one of the two options above.