Training Middleware | Twinkle

DeviceMesh/DeviceGroup

Mon, 01 Jan 0001 00:00:00 +0000

These two classes are used to express hardware resource allocation and network topology. Twinkle’s data distribution and collection also depend on them.

DeviceGroup

@dataclass
class DeviceGroup:
 name: str
 ranks: Union[List[int], int]
 device_type: str
 visible_devices: Optional[str] = None # Optional: explicitly set visible devices (e.g., "8,9")
 gpus_per_worker: int = 1

name: Resource group name
ranks: Occupied hardware list, only supports int type for CPU resources
device_type: Hardware type, such as GPU/CPU/NPU, etc.
visible_devices: Visible resource list, used when you only want to use part of the rank’s hardware
gpus_per_worker: How much hardware each worker occupies

If training RL, developers can construct multiple such groups and assign corresponding models and samplers into them.

DeviceMesh

DeviceMesh carries component topology and distributed parallel information. This class is passed within components for data distribution and data collection.

@dataclass
class DeviceMesh:
 ...

 @staticmethod
 def from_sizes(*, world_size: int = 1, dp_size: int = 1, fsdp_size: int = None, tp_size: int = None,
 pp_size: int = None, ulysses_size: int = None, cp_size: int = None, ep_size: int = None,
 etp_size: int = None,vpp_size: int = None, device_type: str = 'cuda', sequence_parallel: bool = False) -> "DeviceMesh":
 ...

It is recommended to use from_sizes to construct it.

Parameter Reference

Parameter	Description	Default
`world_size`	Total number of processes	1
`dp_size`	Data parallel degree	1
`fsdp_size`	Fully Sharded Data Parallel degree	None
`tp_size`	Tensor parallel degree	None
`pp_size`	Pipeline parallel degree	None
`ulysses_size`	Ulysses sequence parallel degree	None
`cp_size`	Context parallel degree	None
`ep_size`	Expert parallel degree (for MoE models)	None
`etp_size`	Expert tensor parallel degree	None
`ep_fsdp_size`	FSDP degree within each EP group	None
`vpp_size`	Virtual pipeline parallel degree	None
`device_type`	Device type (`cuda`, `npu`, etc.)	`cuda`
`sequence_parallel`	Enable Megatron-style sequence parallel	False

Let’s give an example:

sampler_device_mesh = DeviceMesh.from_sizes(dp_size=4)
actor_device_mesh = DeviceMesh.from_sizes(dp_size=2, pp_size=2, tp_size=2)

dataloader = DataLoader(...)
sampler = vLLMSampler(..., device_mesh=sampler_device_mesh, remote_group=...)
actor = MegatronModel(..., device_mesh=actor_device_mesh, remote_group=...)

for data in dataloader:
 sampler_output = sampler.sample(data)
 input_data = [seq.new_input_feature for response in sampler_output for seq in response.sequences]
 ...
 model_output = actor.forward(input_data)

We analyze the data transfer situation using the pseudo-code above.

dataloader fetches data -> distributes to sampler according to dp_size=4 -> collects data according to dp_size=4 -> distributes to model according to dp_size=2 -> collects output according to dp_size=2

Through DeviceMesh, data flow can be smoothly transferred between various groups and components.

Data distribution judgment is performed by the get_slice method of DeviceMesh:

batch[device_mesh.get_slice(len(batch))]

get_slice calculates which dp group the current worker belongs to based on the current rank and obtains the corresponding data. This process occurs in the DeviceMeshSampler of DataLoader, and also in the dispatch and collect of remote_class.

Expert Parallel (EP)

Mon, 01 Jan 0001 00:00:00 +0000

Expert Parallel distributes Mixture-of-Experts (MoE) model experts across multiple GPUs, allowing each rank to hold a subset of experts. This reduces per-GPU memory and enables training of large MoE models.

Overview

Concept	Description
ExpertParallelConfig	Configuration dataclass controlling EP behavior
apply_expert_parallel()	Entry point that shards experts and patches forward
shard_experts()	Evenly splits experts across EP ranks
patch_forward()	Replaces MoE block forward with EP-aware all-to-all communication

Configuration

from twinkle.model.transformers.moe.expert_parallel import ExpertParallelConfig

config = ExpertParallelConfig(
 enabled=True, # Enable expert parallel
 router_dtype='fp32', # Router computation dtype: 'fp32', 'bf16', 'fp16'
 keep_router_logits=True, # Return router logits alongside hidden states
 ignore_shared_experts=False,# Skip shared expert computation (e.g. DeepSeek)
 ep_size=None, # EP world size (consumed by TransformersModel)
)

Usage with DeviceMesh

EP is activated by setting ep_size in DeviceMesh.from_sizes(). The framework automatically calls apply_expert_parallel() during model initialization.

from twinkle.utils import DeviceMesh

# 8 GPUs: 2-way EP × 4-way data parallel
device_mesh = DeviceMesh.from_sizes(
 world_size=8,
 dp_size=4,
 ep_size=2,
)

For combined EP + FSDP sharding on the expert parameters:

# 8 GPUs: 2-way EP with FSDP within each EP group
device_mesh = DeviceMesh.from_sizes(
 world_size=8,
 dp_size=2,
 ep_size=2,
 ep_fsdp_size=2,
)

Communication Pattern

The EP forward pass follows a 4-stage pipeline:

Preprocess — compute per-expert token counts and split sizes
Token Pre-All2All — permute tokens by expert assignment, then all-to-all exchange across EP ranks
Expert Compute — each rank runs its local experts on received tokens
Token Post-All2All — all-to-all exchange results back, unpermute and apply routing weights

Input tokens → Router → [preprocess] → [pre_all2all] → [local experts] → [post_all2all] → Output

Requirements

num_experts must be divisible by ep_size
torch.distributed must be initialized
MoE blocks must define a gate/router module and experts (either nn.ModuleList or tensor-style gate_up_proj/down_proj)
Both ModuleList-style and tensor-style (fused) experts are supported
Shared experts (e.g. DeepSeek MoE) are handled automatically unless ignore_shared_experts=True

Sequence Parallel (SP)

Mon, 01 Jan 0001 00:00:00 +0000

Sequence Parallel splits long sequences across multiple GPUs along the sequence dimension, enabling training with sequence lengths that exceed single-GPU memory. Twinkle implements Ulysses-style sequence parallel with optional derived ring attention.

Overview

Concept	Description
SequenceParallelConfig	Configuration dataclass for SP
SequenceParallelStrategy	Strategy class that wraps SP lifecycle
SequenceParallel	Core implementation handling pad/split/gather

Configuration

from twinkle.model.transformers.strategy.sequence_parallel import SequenceParallelConfig

config = SequenceParallelConfig(
 enabled=True, # Enable sequence parallel
 ulysses_size=None, # Ulysses SP degree (auto-derived from DeviceMesh if None)
 gather_logits=True, # Gather logits after forward for loss computation
)

Usage with DeviceMesh

SP is activated by setting ulysses_size in DeviceMesh.from_sizes():

from twinkle.utils import DeviceMesh

# 8 GPUs: 4-way Ulysses SP × 2-way data parallel
device_mesh = DeviceMesh.from_sizes(
 world_size=8,
 dp_size=2,
 ulysses_size=4,
)

How It Works

Pad — input sequences are padded to a length divisible by SP world size
Split — padded inputs are evenly split across SP ranks along the sequence dimension
Distributed Attention — FlashAttention2 is patched to perform Ulysses all-to-all communication before/after attention computation
Gather — after forward, logits are gathered back to full sequence length for loss computation

Supported Attention Backends

Backend	Status
FlashAttention2	Fully supported (including packed/padding-free sequences)
SDPA	Supported (non-packed batches only)
Derived Ring Attention	Supported with FlashAttention2 only (`rp_world_size > 1`)

Qwen3.5 Linear Attention

SP automatically detects Qwen3.5 GatedDeltaNet linear attention layers and applies the Qwen3_5GatedDeltaNetUlyssesPatch for correct sequence-parallel behavior on hybrid attention architectures.

MoE Auxiliary Loss

For MoE models, SP automatically installs a forward hook that gathers router logits across SP ranks before auxiliary loss computation, ensuring correct load-balancing signals.

Key Constraints

num_key_value_heads must be divisible by ulysses_size (for Ulysses) or use ring attention fallback
Packed/padding-free batches require FlashAttention2
Derived ring attention requires batch_size == 1 (packed format)
torch.distributed must be initialized

Padding-Free Training

Mon, 01 Jan 0001 00:00:00 +0000

Padding-free (also called “packing”) training eliminates wasted computation on padding tokens by concatenating multiple sequences into a single packed batch. Twinkle supports padding-free training for both standard attention and Qwen3.5’s GatedDeltaNet linear attention.

How It Works

Instead of padding all sequences to max_length, padding-free packs multiple sequences into one row and uses position_ids to track sequence boundaries. This avoids wasted FLOPs on padding tokens.

Standard: [tok tok tok PAD PAD PAD] [tok tok PAD PAD PAD PAD]
Packed: [tok tok tok tok tok ...] ← no padding waste

Usage

Padding-free is enabled via PackingDataset or IterablePackingDataset:

from twinkle.dataset import PackingDataset

dataset = PackingDataset(
 dataset=base_dataset,
 max_length=8192,
)

The dataset automatically packs sequences and generates correct position_ids with resets at sequence boundaries.

GatedDeltaNet Patch (Qwen3.5)

Qwen3.5 uses a hybrid architecture mixing standard attention with GatedDeltaNet linear attention. The native GatedDeltaNet implementation does not reset its linear-attention state at packed sequence boundaries.

GatedDeltaNetPaddingFreePatch fixes this by:

Patching Qwen3_5DecoderLayer.forward to pass cu_seq_lens_q (cumulative sequence lengths) to linear attention layers
Patching Qwen3_5GatedDeltaNet.forward to use flash-linear-attention kernels (causal_conv1d, chunk_gated_delta_rule) with cu_seqlens support

The patch is applied automatically when padding-free is detected on Qwen3.5 models.

Requirements

flash-linear-attention package must be installed
Only needed for Qwen3.5 models with GatedDeltaNet layers
When sequence parallel is enabled, a separate Qwen3_5GatedDeltaNetUlyssesPatch is used instead

Attention Backend Requirements

Attention Backend	Padding-Free Support
FlashAttention2	Fully supported
SDPA	Supported (incompatible with sequence parallel)
Eager	Not supported

RemoteClass

Mon, 01 Jan 0001 00:00:00 +0000

All components in Twinkle that support use in Ray and HTTP are decorated with @remote_class and @remote_function. This decorator intercepts the construction of the class and, in Ray mode, converts the class construction to worker execution.

from twinkle import remote_class, remote_function

@remote_class(execute='first')
class MyComponent:

 def __init__(self, **kwargs):
 ...

 @remote_function(dispatch='slice_dp', collect='first')
 def func(self, *args, **kwargs):
 ...
 return ...

Developers only need to write the above code to transfer the MyComponent class to worker execution. Among them:

remote_class: Marks the class as needing remote execution. If Twinkle initialization is set to local mode, or if the class construction does not pass in a remote_group setting, or if remote_group is the current worker, the class will be constructed within the process.
remote_function: Marks a method of a class marked with remote_class as executable in Ray. Its input and output will be compressed and passed by Ray.

Calling MyComponent:

import twinkle
from twinkle import DeviceGroup

device_groups = [
 DeviceGroup(
 name='default',
 ranks=4,
 device_type='cuda',
 )
]

twinkle.initialize('ray', groups=device_groups)

_my_component = MyComponent(remote_group='default')
_my_component.func(...)

In this way, we wrote a MyComponent and constructed a group called default using 4 GPUs in the Ray cluster, and constructed MyComponent in that group.

Parameters when remote_class decorates a class:

execute: Supports first/all. first will only be created on the 0th device of the group, generally used for the construction of Dataset and DataLoader. all will be constructed on all devices.

Parameters when remote_function decorates a method:

dispatch: How to distribute input data. Supports four types: slice/all/slice_dp/function. slice will evenly distribute list input (non-list will be fully distributed), all performs full distribution, slice_dp will split and distribute the input data according to the dp group of device_mesh to ensure the correctness of model input data. The function method supports distributing input data with your own implementation:

def _dispatcher(length, i, args, kwargs, device_mesh):
 # length is the number of workers, i is the current rank, args and kwargs are input data, execute the distribution logic here
 # device_mesh is the device_mesh belongs to the target component
 return _args_rank, _kwargs_rank

execute: Supports first/all, execute only on the first worker, or execute on all
collect: How to collect returned data, supports none/flatten/mean/sum/first/last_pp/function
- none: Do not process anything
- flatten: Flatten all worker data to mimic the return structure of single worker execution
- mean/sum: Return average or cumulative value
- first: Only return the result of the first worker. Generally used when all workers need input, but the output results are the same
- last_pp: Return the result of the last pipeline, used for pp parallelism
- function: Supports custom collection methods

def _collect(all_results: List, device_mesh):
 # device_mesh is the device_mesh belongs to the target component
 return ...

sync: Whether to execute synchronously using Ray’s method, default is False
lazy_collect: Default is True. In this case, results will not be collected in the driver process, but will be delayed and expanded in the workers that need these results. For specific methods, some methods need to be collected in the driver, such as collecting loss, metrics and other situations with small network load, which can be set to False

TwinkleClient

Mon, 01 Jan 0001 00:00:00 +0000

TwinkleClient is the Python client for interacting with the Twinkle REST API. It manages sessions, training runs, and checkpoints.

Initialization

from twinkle_client.manager import TwinkleClient

client = TwinkleClient(
 base_url='http://localhost:8000', # Or TWINKLE_SERVER_URL env var
 api_key='your-api-key', # Or TWINKLE_SERVER_TOKEN env var
 route_prefix='/twinkle', # API route prefix
 session_heartbeat_interval=10, # Heartbeat interval in seconds
 session_metadata={'user': 'alice'}, # Optional session metadata
)

On init, the client:

Sets base_url and api_key into shared context (used by all client objects)
Creates a server-side session
Starts a background heartbeat thread to keep the session alive

Health Check

is_healthy = client.health_check() # Returns True/False
capabilities = client.get_server_capabilities() # Supported models

Training Runs

# List runs
runs = client.list_training_runs(limit=20, offset=0)

# List with pagination cursor
runs, cursor = client.list_training_runs_with_cursor(limit=20)

# Get specific run
run = client.get_training_run(run_id='run_abc123')

# Find by base model
qwen_runs = client.find_training_run_by_model('Qwen/Qwen3.5-4B')

Checkpoints

# List checkpoints for a run
checkpoints = client.list_checkpoints(run_id='run_abc123')

# Get checkpoint path
parsed = client.get_checkpoint_path(run_id, checkpoint_id)
# parsed.path → filesystem path
# parsed.twinkle_path → twinkle:// URI

# Get latest checkpoint (useful for resume training)
latest_path = client.get_latest_checkpoint_path(run_id)

# Delete checkpoint
client.delete_checkpoint(run_id, checkpoint_id)

Capacity & Weights Info

# LoRA capacity
capacity = client.get_capacity_info()
# capacity.max_loras, capacity.used_loras, capacity.free_loras

# Weights metadata
info = client.get_weights_info('twinkle://run_id/weights/checkpoint')
# info.base_model, info.is_lora, info.lora_rank

Cleanup

client.close() # Stops heartbeat thread (also registered via atexit)