trinity.trainer.verl.fsdp_checkpoint_manager module
- class trinity.trainer.verl.fsdp_checkpoint_manager.FSDPCheckpointManager(*args, **kwargs)[source]
Bases:
FSDPCheckpointManager
An enhanced version of the original FSDP checkpoint manager that:
Uploads model state dicts to a remote Synchronizer actor (either directly or via checkpoints).
Offloads saving operations (model, optimizer, extra states) into background threads to avoid blocking the training loop.
This class is useful in distributed training scenarios where synchronization and non-blocking I/O are important.
- upload_state_dict(global_step: int)[source]
Uploads the full model state dictionary to the synchronizer actor for remote access.
- Parameters:
global_step (int) – The current training step number.
- save_checkpoint(local_path: str, hdfs_path: str | None = None, global_step: int = 0, max_ckpt_to_keep: int | None = None, model_state_dict_only: bool = False)[source]
Modified from verl.utils.checkpoint.fsdp_checkpoint_manager.py:save_checkpoint
Saves the model checkpoint to disk, optionally uploads it to a remote Synchronizer, and uses background threads to prevent blocking the main training loop.
Main improvements over the base class: - Uses separate threads for saving model/optimizer/extras. - Implements synchronization with a remote actor. If the model is not trained (global_step == 0) or continues from a breakpoint, Synchonizer will be notified and the model will not be saved.
- Parameters:
local_path (str) – Local directory path to save the checkpoint.
hdfs_path (str, optional) – HDFS path for saving the checkpoint (not implemented here).
global_step (int) – Current training step.
max_ckpt_to_keep (int, optional) – Maximum number of checkpoints to keep locally.
model_state_dict_only (bool) – Whether to only save the model state dict (no optimizer, etc.).