Runtime Modes

Runtime Modes

Twinkle supports multiple runtime modes for different deployment scenarios. The same training code runs across all modes with minimal changes.

Single GPU

The simplest mode for development and small-scale training:

from twinkle.model import TransformersModel
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta

def train():
    dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition'))
    dataset.set_template('Template', model_id='ms://Qwen/Qwen3.5-4B')
    dataset.encode()
    
    dataloader = DataLoader(dataset=dataset, batch_size=8)
    model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
    
    for batch in dataloader:
        model.forward_backward(inputs=batch)
        model.clip_grad_and_step()

if __name__ == '__main__':
    train()

Run directly:

python train.py

torchrun Mode

Distributed training using PyTorch’s native launcher. No Ray dependencies required.

import twinkle
from twinkle import DeviceMesh

# Construct device mesh: FSDP=4, DP=2
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)

# Initialize in local mode
twinkle.initialize(mode='local', global_device_mesh=device_mesh)

def train():
    # Same training code as single GPU
    ...

if __name__ == '__main__':
    train()

Launch with torchrun:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 train.py

Device Mesh Options

# FSDP + Data Parallelism
DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)

# Tensor + Pipeline Parallelism
DeviceMesh.from_sizes(tp_size=2, pp_size=4)

# Full 3D Parallelism
DeviceMesh.from_sizes(tp_size=2, pp_size=2, dp_size=2)

Ray Mode

Distributed training across Ray clusters with advanced resource management:

import twinkle
from twinkle import DeviceMesh, DeviceGroup

# Define resource groups
device_groups = [
    DeviceGroup(name='model', ranks=4, device_type='cuda'),
    DeviceGroup(name='sampler', ranks=4, device_type='cuda'),
]

# Define parallel topology
model_mesh = DeviceMesh.from_sizes(world_size=4, dp_size=4)
sampler_mesh = DeviceMesh.from_sizes(world_size=4, dp_size=4)

# Initialize Ray mode
twinkle.initialize(
    mode='ray',
    nproc_per_node=8,
    groups=device_groups,
    lazy_collect=False
)

def train():
    model = TransformersModel(
        model_id='ms://Qwen/Qwen3.5-4B',
        remote_group='model',
        device_mesh=model_mesh
    )
    
    sampler = vLLMSampler(
        model_id='ms://Qwen/Qwen3.5-4B',
        device_mesh=sampler_mesh,
        remote_group='sampler'
    )
    ...

if __name__ == '__main__':
    train()

Starting Ray Cluster

# Start head node
CUDA_VISIBLE_DEVICES=0,1 ray start --head --port=6379 --num-gpus=2

# Add worker nodes
CUDA_VISIBLE_DEVICES=2,3 ray start --address=127.0.0.1:6379 --num-gpus=2

# CPU-only node
CUDA_VISIBLE_DEVICES="" ray start --address=127.0.0.1:6379 --num-gpus=0

Run training:

python train.py

HTTP Mode

Deploy training as an HTTP service for multi-tenant access:

Server Setup

# server.py
import twinkle
from twinkle import DeviceGroup, DeviceMesh

device_groups = [
    DeviceGroup(name='model', ranks=4, device_type='cuda'),
    DeviceGroup(name='sampler', ranks=4, device_type='cuda'),
]

twinkle.initialize(mode='http', groups=device_groups)

# Start server services
# Model cluster, Sampler cluster, Utility cluster

python server.py

Client Training

from twinkle_client import init_twinkle_client
from twinkle_client.model import MultiLoraTransformersModel
from twinkle_client.sampler import vLLMSampler

# Connect to server
client = init_twinkle_client(
    base_url='http://localhost:8000',
    api_key='your-api-key'
)

# Configure model
model = MultiLoraTransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
model.add_adapter_to_model('default', lora_config)
model.set_optimizer('AdamW', lr=1e-4)

# Configure sampler
sampler = vLLMSampler(model_id='ms://Qwen/Qwen3.5-4B')

# Training loop
for batch in dataloader:
    responses = sampler.sample(inputs=batch, sampling_params=params)
    model.forward_backward(inputs=responses, advantages=advantages)
    model.step()

Mode Comparison

Mode	Use Case	Dependencies	Scale
Single GPU	Development, small models	None	1 GPU
torchrun	Multi-GPU training	PyTorch	Single node
Ray	Multi-node, RL training	Ray	Multi-node cluster
HTTP	TaaS, Multi-tenancy	Ray + FastAPI	Enterprise

Best Practices

Development: Start with single GPU mode for rapid iteration
Scaling: Move to torchrun for multi-GPU training
RL Training: Use Ray mode for model-sampler coordination
Production: Deploy HTTP mode for multi-tenant services

← Components

Multi-Tenancy →

No results found