Model | Twinkle

Supported Models

Mon, 01 Jan 0001 00:00:00 +0000

Twinkle supports any model compatible with HuggingFace Transformers or Megatron-LM. Below is a curated list of models tested with Twinkle.

Language Models

Model Family	Model IDs	Parameters	Features
Qwen 3.5	`Qwen/Qwen3.5-0.6B` ~ `Qwen/Qwen3.5-235B-A22B`	0.6B–235B	MoE, Thinking mode
Qwen 2.5	`Qwen/Qwen2.5-0.5B` ~ `Qwen/Qwen2.5-72B`	0.5B–72B	Dense
DeepSeek V4	`deepseek-ai/DeepSeek-V4`	685B MoE	Custom DSML encoding
DeepSeek R1	`deepseek-ai/DeepSeek-R1`	685B MoE	Reasoning
LLaMA 3	`meta-llama/Llama-3.3-70B-Instruct`	8B–70B	Dense
Mistral	`mistralai/Mistral-7B-v0.3`	7B	Dense
Yi	`01-ai/Yi-1.5-34B`	6B–34B	Dense
GLM-4	`THUDM/glm-4-9b-chat`	9B	Dense
InternLM 2.5	`internlm/internlm2_5-7b-chat`	7B–20B	Dense

Vision-Language Models

Model Family	Model IDs	Features
Qwen 3.5 VL	`Qwen/Qwen3.5-VL-3B` ~ `Qwen/Qwen3.5-VL-72B`	Image, Video
Qwen 2.5 VL	`Qwen/Qwen2.5-VL-7B-Instruct`	Image, Video
InternVL 2.5	`OpenGVLab/InternVL2_5-8B`	Image

Embedding Models

Model Family	Model IDs	Training Method
Qwen3 Embedding	`Qwen/Qwen3-Embedding-0.6B`	InfoNCE contrastive
GTE	`thenlper/gte-large-zh`	InfoNCE contrastive

Model Loading

Models can be loaded from ModelScope or HuggingFace:

from twinkle.model import TransformersModel

# From ModelScope (ms:// prefix)
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')

# From HuggingFace (hf:// prefix)
model = TransformersModel(model_id='hf://meta-llama/Llama-3.3-70B-Instruct')

# Local path
model = TransformersModel(model_id='/path/to/model')

Framework Support

Framework	Class	Use Case
Transformers	`TransformersModel`	General training (SFT, RLHF, DPO)
Transformers + Multi-LoRA	`MultiLoraTransformersModel`	Multi-tenant training
Megatron-LM	`MegatronModel`	Large-scale distributed pre-training
Megatron + Multi-LoRA	`MultiLoraMegatronModel`	Large-scale multi-tenant

Precision Support

Mode	Description
`bf16`	BFloat16 mixed precision (recommended for A100/H100)
`fp16`	Float16 mixed precision (for older GPUs)
`fp8`	FP8 precision (H100 with Transformer Engine)
`no`	Full precision (debugging only)

Parallelism Strategies

Strategy	Config Key	Description
FSDP	`strategy=accelerate`	Accelerate-managed FSDP (default)
Native FSDP	`strategy=native_fsdp`	PyTorch native FSDP
Tensor Parallel	`tp_size`	Split layers across GPUs
Pipeline Parallel	`pp_size`	Split model stages
Data Parallel	`dp_size`	Replicate model, split data
Sequence Parallel	`sequence_parallel`	Split long sequences
Expert Parallel	`ep_size`	MoE expert distribution

TwinkleModel

Mon, 01 Jan 0001 00:00:00 +0000

TwinkleModel is the base class for all models in Twinkle. Twinkle’s models not only include the model itself, but also the supporting training components of the model. The components we introduce in other documents are basically combined and used here.

Any model that conforms to the base class settings of TwinkleModel can be used with other components of the framework:

class TwinkleModel(ABC):

 @abstractmethod
 def forward(self, *, inputs: Dict[str, Any], **kwargs):
 # Perform a forward pass and return logits
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def forward_only(self, *, inputs: Dict[str, Any], **kwargs):
 # Perform a forward pass in inference mode and return logits
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def calculate_loss(self, **kwargs):
 # Complete loss calculation using Loss subclass
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def backward(self, **kwargs):
 # Perform a backward pass
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def forward_backward(self, *, inputs: Dict[str, Any], **kwargs):
 # Combines forward, loss calculation, and backward process, and returns loss value
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def clip_grad_norm(self, max_grad_norm: float = 1.0, norm_type=2, **kwargs):
 # Gradient clipping, occurs when gradient_accumulation_steps are complete, can pass gradient_accumulation_steps in kwargs
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def step(self, **kwargs):
 # Gradient update, occurs when gradient_accumulation_steps are complete, can pass gradient_accumulation_steps in kwargs
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def zero_grad(self, **kwargs):
 # Gradient clearing, occurs when gradient_accumulation_steps are complete, can pass gradient_accumulation_steps in kwargs
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def lr_step(self, **kwargs):
 # Learning rate update, occurs when gradient_accumulation_steps are complete, can pass gradient_accumulation_steps in kwargs
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def clip_grad_and_step(self, max_grad_norm: float=1.0, norm_type=2, **kwargs):
 # Combines clip, step, zero_grad, lr_step
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def set_loss(self, loss_cls: Union[Loss, Type[Loss], str, Callable[[InputFeature, ModelOutput, ...], torch.Tensor]], **kwargs):
 # Set loss
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def set_optimizer(self, optimizer_cls: Union[Optimizer, Type[Optimizer], str], **kwargs):
 # Set optimizer
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def set_lr_scheduler(self, scheduler_cls: Union[LRScheduler, Type[LRScheduler], str], **kwargs):
 # Set lr_scheduler
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def save(self, name: str, output_dir: Optional[str] = None, **kwargs):
 # Save checkpoint
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def load(self, name: str, output_dir: Optional[str] = None, **kwargs):
 # Load checkpoint
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def get_state_dict(self, **kwargs):
 # Get state_dict
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def apply_patch(self, patch_cls: Union[Patch, Type[Patch], str], **kwargs):
 # Apply a patch to the model

 @abstractmethod
 def add_metric(self, metric_cls: Union[Metric, str], is_training, **kwargs):
 # Add a training metric, can set is_training parameter, representing accumulation in forward/forward_only. If not set, it will take effect separately for forward/forward_only
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def calculate_metric(self, is_training: bool, **kwargs):
 # Calculate metric and return
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def add_adapter_to_model(self, adapter_name: str, config_or_dir, **kwargs):
 # Add a lora

 @abstractmethod
 def set_template(self, template_cls: Union[Template, Type[Template], str], **kwargs):
 # Set template
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def set_processor(self, processor_cls: Union[InputProcessor, Type[InputProcessor], str], **kwargs):
 # Set task data processing
 # Supports adapter_name parameter to take effect on a specific lora

 @abstractmethod
 def get_train_configs(self, **kwargs) -> str:
 # Get model training configuration for printing
 # Supports adapter_name parameter to take effect on a specific lora

TransformersModel

Mon, 01 Jan 0001 00:00:00 +0000

This model encapsulates the transformers LLM and can start and train models using FSDP2, DDP and other methods.

class TransformersModel:

 def __init__(self, # noqa
 model_cls: Optional[Union[Type[PreTrainedModel], str, Type[_BaseAutoModelClass]]] = AutoModelForCausalLM,
 model_id: Optional[str] = None,
 config: Optional[PretrainedConfig] = None,
 device_mesh: Optional[DeviceMesh] = None,
 mixed_precision: Literal['no', 'fp8', 'fp16', 'bf16'] = 'bf16',
 strategy: Literal['accelerate', 'native_fsdp'] = 'accelerate',
 ddp_config: Dict[str, Any] = None,
 fsdp_config: Dict[str, Any] = None,
 grad_scaler_config: Dict[str, Any] = None,
 memory_efficient_init: bool = False,
 **kwargs):
 ...

 ...

model_cls: Which class to use to start the model, default is AutoModelForCausalLM
model_id: Model id
config: Configuration for starting the model
device_mesh: DeviceMesh information
mixed_precision: Mixed precision information, default bf16, recommended to keep unchanged if you have GPUs with 30 series or above
strategy: How to encapsulate the model for multi-GPU training, default uses accelerate framework.
ddp_config: DDP configuration when strategy is accelerate, see:
fsdp_config: FSDP configuration when strategy is accelerate, see:
grad_scaler_config: PyTorch’s grad_scaler initialization configuration, see:
memory_efficient_init: Whether to enable memory-efficient model initialization for FSDP. When enabled, only rank 0 loads full weights and broadcasts sharded parameters to other ranks, reducing peak memory usage during initialization. Default False. Note: The optimization currently only applies to transformers <= 4.57.6; for transformers >= 5.0.0, it may lead to negative performance impact.
kwargs:
- If you don’t want to pass the model config field, you can put scattered configurations here. These parameters will be passed to from_pretrained or from_config later.

TransformersModel supports the @remote_class annotation and supports device_mesh, which means it can run in Ray workers.

Usage example:

from twinkle.model import TransformersModel
from twinkle import DeviceMesh
from twinkle.dataloader import DataLoader
dataloader = DataLoader(...)
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B', device_mesh=DeviceMesh.from_sizes(dp_size=2, fsdp_size=2), remote_group='actor')
model.add_adapter_to_model(...)
model.set_optimizer(..., adapter_name='...')
for data in dataloader:
 model.forward_backward(...)
 model.clip_grad_and_step(..., gradient_accumulation_steps=16)

Checkpoint and Resume

TransformersModel.save() can save either weights only or a resumable training checkpoint.

model.save(name, save_optimizer=True, consumed_train_samples=...) saves weights together with optimizer, scheduler, scaler, RNG, and trainer_state.json.
model.resume_from_checkpoint(checkpoint_dir) restores full training state (weights, optimizer, scheduler, scaler, RNG) and returns {'cur_step', 'consumed_train_samples', 'gradient_accumulation_steps'}.
model.resume_from_checkpoint(checkpoint_dir, resume_only_model=True) loads weights only and returns progress metadata without restoring optimizer state.
dataloader.resume_from_checkpoint(consumed_train_samples) skips already-consumed samples.
dataloader.get_state() returns {'consumed_train_samples': int} — the dataloader automatically tracks consumed samples, so you don’t need to maintain a counter manually.

For full-parameter training, restore model weights by constructing TransformersModel with the checkpoint path as model_id, for example TransformersModel(model_id='./output/fsdp2/last-checkpoint'), and then call resume_from_checkpoint(...) to restore optimizer state and training progress.

For end-to-end resume logic, including dataloader skipping, refer to cookbook/transformers/fsdp2.py.

MultiLoraTransformersModel

Mon, 01 Jan 0001 00:00:00 +0000

This model inherits from TransformersModel. In addition to providing the same functions, it also provides the ability to run multiple loras in time-sharing, mainly used for multi-tenant training.

class MultiLoraTransformersModel:

 def __init__(self, # noqa
 model_cls = AutoModelForCausalLM,
 model_id: Optional[str] = None,
 config: Optional[PretrainedConfig] = None,
 device_mesh: Optional[DeviceMesh] = None,
 mixed_precision: Literal['no', 'fp8', 'fp16', 'bf16'] = 'bf16',
 grad_scaler_config: Dict[str, Any] = None,
 max_loras: int = 5,
 max_r: int = 32,
 max_length: int = 8192,
 **kwargs):
 ...

 ...

In addition to the same parameters as the base class, this class provides several additional parameters for multi-lora configuration:

max_loras: Maximum number of loras
max_r: Maximum lora rank
max_length: Maximum supported training length

The reason for the existence of max_loras and max_r parameters is that Twinkle’s multi-lora technical solution is to add loras to max_loras before DDP wrap to prevent later added loras from being unable to accept DDP management. Because of this, the user’s r must be less than or equal to the max_r configuration. During actual training, only part of the lora’s rank will be used in the calculation.

MultiLoraTransformersModel supports the @remote_class annotation and supports device_mesh, which means it can run in Ray workers.

Tenant Lifecycle

Under the hood, MultiLoraTransformersModel uses the MultiLora manager to handle tenant LoRA slots. The key APIs:

acquire_lora

Claim an available LoRA slot for a tenant:

adapter_name = model.multi_lora.acquire_lora('tenant_a', LoraConfig(r=16, lora_alpha=32))

Raises RuntimeError if all slots are in use or config.r > max_r

release_lora

Release a tenant’s LoRA slot, resetting weights to initial state:

model.multi_lora.release_lora('tenant_a')

Context Manager

Use adapter() for scoped activation:

with model.multi_lora.adapter('tenant_a') as name:
 output = model.forward(inputs)

LoraTenant

Each slot is tracked as a LoraTenant dataclass:

@dataclass
class LoraTenant:
 index: int # Slot index (0..max_loras-1)
 adapter_name: str # Internal name (e.g. "lora_0")
 config: LoraConfig # Pre-allocated config (max_r)
 tenant_adapter_name: str # User-facing tenant name (None if free)
 tenant_config: LoraConfig # Tenant's actual config (None if free)

MegatronModel

Mon, 01 Jan 0001 00:00:00 +0000

This model encapsulates Megatron LLM and can start the model using TP/DP/CP/PP/EP combinations.

Note: VPP support currently has issues, please do not configure and use it for now.

class MegatronModel:

 def __init__(
 self,
 model_id: str,
 config: Optional[PretrainedConfig] = None,
 device_mesh: Optional[DeviceMesh] = None,
 mixed_precision: Literal['no', 'fp16', 'bf16'] = 'bf16',
 **kwargs,
 ):
 ...

 ...

model_id: Model id
config: Configuration for starting the model
device_mesh: DeviceMesh information
mixed_precision: Mixed precision information, default bf16, recommended to keep unchanged if you have GPUs with 30 series or above
kwargs:
- All Megatron initialization parameters, i.e., configurations can be passed into kwargs.

MegatronModel supports the @remote_class annotation and supports device_mesh, which means it can run in Ray workers.

Usage example:

from twinkle.model import MegatronModel
from twinkle import DeviceMesh
from twinkle.dataloader import DataLoader
dataloader = DataLoader(...)
model = MegatronModel(model_id='ms://Qwen/Qwen3.5-4B', device_mesh=DeviceMesh.from_sizes(dp_size=2, tp_size=2, pp_size=2), remote_group='actor')
model.add_adapter_to_model(...)
model.set_optimizer('default', adapter_name='...')
for data in dataloader:
 model.forward_backward(...)
 model.clip_grad_and_step(..., gradient_accumulation_steps=16)

Note:

Megatron models do not support using AdamW’s original optimizer, only support configuring MegatronDistributedOptimizer, you can pass MegatronDistributedOptimizer, default to use it

Megatron models do not support using other lr_schedulers, only support using OptimizerParamScheduler, you can pass OptimizerParamScheduler, default to use it

You need to pass tp/cp/dp/ep/pp/sequence_parallel configurations into the device_mesh parameter to facilitate twinkle to manage data distribution. These parameters will be passed by device_mesh to the megatron initialization process

MultiLoraMegatronModel

Mon, 01 Jan 0001 00:00:00 +0000

This model inherits from MegatronModel. In addition to providing the same functions, it also provides the ability to run multiple loras in time-sharing, mainly used for multi-tenant training.

class MultiLoraMegatronModel:

 def __init__(self, # noqa
 model_id: str,
 config: Optional[PretrainedConfig] = None,
 device_mesh: Optional[DeviceMesh] = None,
 mixed_precision: Literal['no', 'fp16', 'bf16'] = 'bf16',
 max_loras: int = 5,
 max_r: int = 32,
 max_length: int = 8192,
 **kwargs):
 ...

 ...

In addition to the same parameters as the base class, this class provides several additional parameters for multi-lora configuration:

max_loras: Maximum number of loras
max_r: Maximum lora rank
max_length: Maximum supported training length

MultiLoraMegatronModel supports the @remote_class annotation and supports device_mesh, which means it can run in Ray workers.