Checkpoint Engine | Twinkle

CheckpointEngine

Mon, 01 Jan 0001 00:00:00 +0000

CheckpointEngine is a component used to synchronize model weights between trainer and inference processes, primarily used in RLHF training to synchronize weights between Actor models and Rollout samplers.

Basic Interface

class CheckpointEngine(ABC):
 """Checkpoint engine base class

 The checkpoint engine handles weight synchronization between trainer and inference processes.
 """

 @abstractmethod
 def prepare(self) -> dict[str, Any]:
 """Prepare for weight synchronization"""
 ...

 @abstractmethod
 def init_process_group(self, rank: int, world_size: int, **kwargs):
 """Initialize process group"""
 ...

 @abstractmethod
 async def send_weights(self, weight_generator):
 """Send weights (called in trainer process)"""
 ...

 @abstractmethod
 def receive_weights(self) -> AsyncGenerator:
 """Receive weights (called in inference process)"""
 ...

 @abstractmethod
 def finalize(self):
 """Clean up resources"""
 ...

Available Checkpoint Engines

Twinkle provides two checkpoint engine implementations:

NCCLCheckpointEngine

A checkpoint engine that uses NCCL for high-speed weight transfer between GPUs.

High-Speed Transfer: Uses NCCL for GPU-to-GPU point-to-point high-speed transfer
Zero-Copy: Direct transfer between GPU memories without going through CPU
Bucketed Transfer: Supports bucketed transfer for large models

See:

HCCLCheckpointEngine

A checkpoint engine that uses HCCL for weight transfer between Ascend NPUs.

NPU Optimized: Weight transfer optimized specifically for Ascend NPUs
Efficient Communication: Uses HCCL for high-speed communication between NPUs
Compatible Interface: Maintains consistent interface with NCCLCheckpointEngine

See:

How to Choose

NCCLCheckpointEngine: Suitable for GPU environments, provides the highest transfer performance
HCCLCheckpointEngine: Suitable for Ascend NPU environments

Checkpoint engine is a key component of RLHF training infrastructure, ensuring that trainers and samplers use consistent model weights. Currently, synchronization is divided into two cases based on merge_and_sync=True/False. When set to True, the LoRA is merged into the base model and then synchronized. When set to False, only the LoRA weights are synchronized. Additionally, for multi-tenant scenarios, LoRA files are directly attached to vLLM. When merge_and_sync=False or in multi-tenant mode, vLLM’s startup parameter enable_lora=True needs to be enabled. When merge_and_sync=True or using full parameters, this value should be set to False.

NCCLCheckpointEngine

Mon, 01 Jan 0001 00:00:00 +0000

A checkpoint engine that uses NCCL for high-speed weight transfer between GPUs.

Usage Example

from twinkle.checkpoint_engine import NCCLCheckpointEngine

# In training process (rank 0)
engine = NCCLCheckpointEngine(bucket_size=512<<20) # 512MB bucket
engine.is_master = True
engine.prepare()
engine.init_process_group(rank=0, world_size=5)

# Send weights
await engine.send_weights(model.named_parameters())
engine.finalize()

# In inference process (rank 1-4)
engine = NCCLCheckpointEngine(bucket_size=512<<20)
engine.prepare()
engine.init_process_group(rank=1, world_size=5, master_metadata=metadata)

# Receive weights
async for name, tensor in engine.receive_weights():
 model.load_state_dict({name: tensor}, strict=False)
engine.finalize()

Features

High-Speed Transfer: Uses NCCL for GPU-to-GPU point-to-point high-speed transfer
Zero-Copy: Direct transfer between GPU memories without going through CPU
Bucketed Transfer: Supports bucketed transfer for large models

Configuration Parameters

bucket_size: Weight bucket size, controls the amount of data transferred each time. Larger buckets can improve transfer efficiency but consume more memory
timeout: Transfer timeout duration

NCCLCheckpointEngine is the recommended choice for GPU training, providing the highest transfer performance.

HCCLCheckpointEngine

Mon, 01 Jan 0001 00:00:00 +0000

A checkpoint engine that uses HCCL for weight transfer between Ascend NPUs.

Usage Example

from twinkle.checkpoint_engine import HCCLCheckpointEngine

engine = HCCLCheckpointEngine(bucket_size=512<<20)
# Usage is the same as NCCLCheckpointEngine

Features

NPU Optimized: Weight transfer optimized specifically for Ascend NPUs
Efficient Communication: Uses HCCL for high-speed communication between NPUs
Compatible Interface: Maintains consistent interface with NCCLCheckpointEngine

Use Cases

HCCLCheckpointEngine is specifically designed for Ascend NPU environments:

Training on Huawei Ascend NPUs
Synchronizing model weights between NPUs
Large-scale NPU cluster deployment

Environment Variables

TWINKLE_CKPT_HCCL_META_TIMEOUT_S: Controls the timeout (in seconds) for the HCCL CheckpointEngine metadata handshake channel (ZMQ REQ/REP). Default is 300. This value should be an integer greater than 0.

In Ascend NPU environments, HCCLCheckpointEngine provides performance comparable to NCCL.