Usage Guide | Twinkle

Training Guide

Mon, 01 Jan 0001 00:00:00 +0000

✨ What is Twinkle?

A component library for large model training. Based on PyTorch, it is simpler, more flexible, and production-ready.

🧩 Loosely Coupled Architecture · Standardized Interfaces
🚀 Multiple Runtime Modes · torchrun / Ray / HTTP
🔌 Multi-Framework Compatible · Transformers / Megatron
👥 Multi-Tenant Support · Single Base Model Deployment

Twinkle Compatibility

Twinkle and are both model training frameworks, but they have very different characteristics. Developers can choose based on their needs.

When to Choose Twinkle

If you are a beginner in large models and want to better understand model mechanisms and training methods
If you are a large model researcher who wants to customize models or training methods
If you are good at writing training loops and want to customize the training process
If you want to provide enterprise-level or commercial training platforms

When to Choose ms-swift

If you don’t care about the training process and just want to provide a dataset to complete training
If you need more model support and dataset varieties
If you need various types of training such as Embedding, Reranker, Classification
If you need other capabilities like inference, deployment, quantization
If you are sensitive to new model training support, Swift guarantees day-0 update capability

Model Training and Twinkle

When you find that general-purpose large models cannot meet your needs, training becomes essential:

Make the model know you: Through self-cognition training, the model can answer questions like “Who are you?” and “Who is your developer?”, becoming an AI assistant exclusively yours.
Make the model understand your business: By fine-tuning with private data, the model can learn your industry terminology, business processes, and internal knowledge base, becoming a domain expert.
Make the model think your way: Through reinforcement learning (RL), you can define reward rules to guide the model in generating outputs that match your expected format, reasoning style, or values.
Make the model stronger: Distill capabilities from large models to smaller ones, or inject new knowledge through continued pre-training, enabling the model’s capabilities to continuously evolve.

After training is complete, you can deploy the model to your own servers, publish it to ModelScope/Hugging Face to share with the community, or deploy your service using deployment frameworks like vLLM.

Existing training frameworks can be roughly divided into three categories:

Low-level frameworks (e.g., native PyTorch): Highly flexible, but require developers to build infrastructure from scratch including distributed computing, data loading, checkpointing, etc., resulting in high development costs and long cycles.
High-level frameworks (e.g., ms-swift, transformers Trainer): Ready to use out of the box—just provide the dataset and configuration to complete training—but the training process is a black box, making it difficult to customize algorithm details.
Heavy-duty frameworks (e.g., Megatron-LM): Designed for ultra-large-scale models with support for complex parallelism strategies, but have a steep learning curve and highly invasive code requirements.

Twinkle’s design goal is to find a balance among these three types of frameworks:

Retain control over the training loop: Developers can clearly see and control every step of forward, backward, and step, making it easy to debug and customize algorithms.
Provide highly cohesive component abstractions: Components like Dataset, Model, Sampler, and Loss each have their own responsibilities and can be used independently or in combination, without requiring full integration.
Hide distributed complexity: Whether using a single GPU, torchrun, or a Ray cluster, the training code remains almost identical—only the initialization parameters need to be modified.
Support production-grade deployment: Built-in capabilities for multi-tenancy, HTTP services, weight synchronization, and more, ready for building enterprise-level training platforms.

Usage Patterns

Using Only Partial Components

Developers can use only a portion of Twinkle’s components, combining them with their own existing code to complete training work. For example, using only Dataset & DataLoader:

from twinkle.dataset import PackingDataset, DatasetMeta
from twinkle.dataloader import DataLoader
from twinkle.preprocessor import SelfCognitionProcessor

def train():
 dataset_meta = DatasetMeta(
 dataset_id='ms://swift/self-cognition',
 )

 dataset = PackingDataset(dataset_meta)
 dataset.map(SelfCognitionProcessor(model_name='Twinkle Model', model_author='ModelScope Community'))
 dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)
 dataset.encode()
 dataset.pack_dataset()

 dataloader = DataLoader(dataset, batch_size=8)
 for data in dataloader:
 print(data)
 """
 {
 "input_ids": [...],
 "position_ids": [...],
 ...
 }
 """
 break

if __name__ == '__main__':
 train()

In the code above, we use PackingDataset to load a dataset called swift/self-cognition. PackingDataset can be used to bin-pack data, ensuring that each batch has a length similar to the configured maximum length. In the loop, we simply used print to display the output. In actual use, you can continue writing your custom training code below.

All of Twinkle’s components support being used separately. Please refer to the component list in the sections below.

Single GPU

Twinkle supports running training on a single GPU. Here is an example:

from peft import LoraConfig

from twinkle import get_device_placement, get_logger
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.preprocessor import SelfCognitionProcessor

logger = get_logger()


def train():
 # 1000 samples
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
 # Set template to prepare encoding
 dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B')
 # Preprocess the dataset to standard format
 dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
 # Encode dataset
 dataset.encode()
 # Global batch size = 8, for GPUs, so 1 sample per GPU
 dataloader = DataLoader(dataset=dataset, batch_size=8)
 # Use a TransformersModel
 model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')

 lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear')

 # Add a lora to model, with name `default`
 # Comment this to use full-parameter training
 model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
 # Add Optimizer for lora `default`
 model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
 # Add LRScheduler for lora `default`
 model.set_lr_scheduler(
 scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader))
 logger.info(get_device_placement())
 # Print the training config
 logger.info(model.get_train_configs())
 logger.info(f'Total steps: {len(dataloader)}')
 for step, batch in enumerate(dataloader):
 # Do forward and backward
 model.forward_backward(inputs=batch)
 # Step
 model.clip_grad_and_step()
 if step % 20 == 0:
 # Print metric
 metric = model.calculate_metric(is_training=True)
 logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
 model.save(f'last-checkpoint')


if __name__ == '__main__':
 train()

In this training code, we constructed a dataset and loaded the Qwen/Qwen3.5-4B model, used LoRA with the all-linear approach, and completed one training run. In the logs, you can observe the process of loss gradually converging.

Tip — Full-Parameter Training: The example above uses LoRA for efficiency. To switch to full-parameter training, simply remove the add_adapter_to_model call (and the from peft import LoraConfig import). Everything else stays the same.

torchrun

Twinkle supports running training in torchrun mode. In this scenario, Ray-related dependencies do not need to be installed.

from peft import LoraConfig

import twinkle
from twinkle import DeviceMesh, get_device_placement, get_logger
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.preprocessor import SelfCognitionProcessor

# Construct a device_mesh, fsdp=4, dp=2
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)
# use torchrun mode
twinkle.initialize(mode='local', global_device_mesh=device_mesh)

logger = get_logger()


def train():
 # 1000 samples
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
 # Set template to prepare encoding
 dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B')
 # Preprocess the dataset to standard format
 dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
 # Encode dataset
 dataset.encode()
 # Global batch size = 8, for GPUs, so 1 sample per GPU
 dataloader = DataLoader(dataset=dataset, batch_size=8)
 # Use a TransformersModel
 model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')

 lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear')

 # Add a lora to model, with name `default`
 # Comment this to use full-parameter training
 model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
 # Add Optimizer for lora `default`
 model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
 # Add LRScheduler for lora `default`
 model.set_lr_scheduler(
 scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader))
 logger.info(get_device_placement())
 # Print the training config
 logger.info(model.get_train_configs())
 logger.info(f'Total steps: {len(dataloader)}')
 for step, batch in enumerate(dataloader):
 # Do forward and backward
 model.forward_backward(inputs=batch)
 # Step
 model.clip_grad_and_step()
 if step % 20 == 0:
 # Print metric
 metric = model.calculate_metric(is_training=True)
 logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
 model.save(f'last-checkpoint')


if __name__ == '__main__':
 train()

In the code above, we constructed a hybrid parallel mode combining FSDP2 and DP, and used 8 GPUs for training. You can see that it is basically the same as the single-GPU training code, except that DeviceMesh is used to declare the model layout.

When running, you need to launch training like this:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 train.py

Resume from Checkpoint

The training loops above can be extended to support checkpoint resumption. For a complete example, refer to cookbook/transformers/fsdp2.py.

Saving a Checkpoint

model.save(
 checkpoint_name,
 output_dir='./output/fsdp2',
 adapter_name=ADAPTER_NAME,
 save_optimizer=True, # Store optimizer state
 consumed_train_samples=dataloader.get_state()['consumed_train_samples'], # Persist training progress
)

DataLoader automatically tracks consumed samples internally — call dataloader.get_state() to retrieve the current count.

Resuming Training

from pathlib import Path

RESUME_FROM_CHECKPOINT = './output/fsdp2/last-checkpoint'
RESUME_ONLY_MODEL = False # True: weights only, skip optimizer/scheduler restoration
IGNORE_DATA_SKIP = False # True: do not skip consumed samples from trainer_state.json

if RESUME_FROM_CHECKPOINT:
 checkpoint_path = str(Path(RESUME_FROM_CHECKPOINT).expanduser().resolve())
 progress = model.resume_from_checkpoint(checkpoint_path, resume_only_model=RESUME_ONLY_MODEL)
 if not IGNORE_DATA_SKIP:
 dataloader.resume_from_checkpoint(progress['consumed_train_samples'])

How the two flags combine:

`RESUME_ONLY_MODEL`	`IGNORE_DATA_SKIP`	Effect
`False` (default)	`False` (default)	Full resume: restore weights + optimizer + scheduler + RNG, skip consumed data
`True`	`False`	Weights only, but still skip consumed data (restart optimization from fresh)
`True`	`True`	Weights only, restart dataset from the beginning

LoRA / Adapter vs Full-Parameter Training

The flow above uses LoRA as the default example. For full-parameter training, the only difference is in TransformersModel initialization — use the checkpoint path as model_id instead of the base model ID:

# LoRA / adapter: base model loaded from hub, checkpoint contains only adapter weights + training state
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
progress = model.resume_from_checkpoint(resume_path)

# Full-parameter: model weights are saved entirely in the checkpoint — use it directly as model_id
model = TransformersModel(model_id=resume_path)
progress = model.resume_from_checkpoint(resume_path)

All subsequent calls to resume_from_checkpoint and dataloader.resume_from_checkpoint are identical in both cases.

Ray Training

is a commonly used scheduling middleware framework for multi-machine model training and inference scenarios. It provides additional optimizations for multi-model, multi-device execution and resource management, and supports integration with Kubernetes systems for production deployment. These characteristics make it particularly suitable for complex training scenarios such as RL and GKD.

Twinkle supports using Ray for training and sampling, and its code is almost identical to the training API above:

import os
from typing import List, Tuple, Dict, Any
from peft import LoraConfig
import twinkle
from twinkle import DeviceMesh, DeviceGroup, get_device_placement
from twinkle.advantage import GRPOAdvantage
from twinkle.checkpoint_engine import CheckpointEngineManager
from twinkle.data_format import SamplingParams
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model.megatron import MegatronModel
from twinkle.metric import CompletionRewardMetric
from twinkle.preprocessor.llm import GSM8KProcessor
from twinkle.processor import InputProcessor
from twinkle.reward import GSM8KAccuracyReward, GSM8KFormatReward
from twinkle.sampler import vLLMSampler
from twinkle.template import Template

MODEL_ID = os.environ.get('MODEL_ID', 'ms://Qwen/Qwen3.5-4B')
MODEL_GPUS = int(os.environ.get('MODEL_GPUS', 4))
SAMPLER_GPUS = int(os.environ.get('SAMPLER_GPUS',4))
NUM_GPUS = MODEL_GPUS + SAMPLER_GPUS
NUM_GENERATIONS = int(os.environ.get('NUM_GENERATIONS', 8))
MAX_NEW_TOKENS = int(os.environ.get('MAX_NEW_TOKENS', 4096))
LEARNING_RATE = float(os.environ.get('LR', 1e-5))
MAX_STEPS = int(os.environ.get('MAX_STEPS', 200))
BATCH_SIZE = int(os.environ.get('BATCH_SIZE', 16)) # global prompt-level, global completion-level batch size = BATCH_SIZE * num_generations * dp_size
MINI_BATCH_SIZE = int(os.environ.get('MINI_BATCH_SIZE', 16)) # global completion-level mini-batch-size
MICRO_BATCH_SIZE = int(os.environ.get('MICRO_BATCH_SIZE', 2)) # per-device-micro-batch-size (completion-level), batch_size in forward_backward
GRADIENT_ACCUMULATION_STEPS = int(os.environ.get('GRADIENT_ACCUMULATION_STEPS', 1))
ADAPTER_NAME = 'default'

def create_gsm8k_dataset():
 dataset = Dataset(DatasetMeta('ms://modelscope/gsm8k', subset_name='main', split='train'))
 dataset.set_template('Qwen3_5Template', model_id=MODEL_ID, max_length=2048)
 dataset.map(GSM8KProcessor())
 dataset.encode(add_generation_prompt=True)
 return dataset

def compute_rewards(
 trajectories: List[Dict[str, Any]],
) -> Tuple[List[float], List[float], List[float]]:
 accuracy_reward_fn = GSM8KAccuracyReward()
 format_reward_fn = GSM8KFormatReward()
 accuracy_rewards = accuracy_reward_fn(trajectories)
 format_rewards = format_reward_fn(trajectories)
 total_rewards = [a + f for a, f in zip(accuracy_rewards, format_rewards)]
 return total_rewards, format_rewards, accuracy_rewards

def main():
 # set sampler and model separate to use different gpus
 device_groups = [
 DeviceGroup(name='model',ranks=list(range(MODEL_GPUS)),device_type='GPU'),
 DeviceGroup(name='sampler',ranks=list(range(MODEL_GPUS, NUM_GPUS)),device_type='GPU'),
 ]
 model_mesh = DeviceMesh.from_sizes(world_size=MODEL_GPUS, dp_size=MODEL_GPUS)
 sampler_mesh = DeviceMesh.from_sizes(world_size=SAMPLER_GPUS, dp_size=SAMPLER_GPUS)
 twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False)

 lora_config = LoraConfig(target_modules='all-linear', r=32, lora_alpha=64, lora_dropout=0.05)
 model = MegatronModel(model_id=MODEL_ID, device_mesh=model_mesh, remote_group='model', mixed_precision='bf16')
 model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)
 model.set_optimizer('default', lr=LEARNING_RATE)
 model.set_lr_scheduler('default', lr_decay_steps=MAX_STEPS, max_lr=LEARNING_RATE)
 model.set_loss('GRPOLoss', epsilon=0.2)
 model.set_processor(InputProcessor)
 model.set_template('Qwen3_5Template', model_id=MODEL_ID)

 sampler = vLLMSampler(
 model_id=MODEL_ID,
 engine_args={
 'gpu_memory_utilization': 0.8,
 'max_model_len': 4096,
 'max_lora_rank': 32, # save as lora_config
 'enable_lora': True,
 },
 device_mesh=sampler_mesh,
 remote_group='sampler',
 )
 sampler.set_template('Qwen3_5Template', model_id=MODEL_ID)
 ckpt_manager = CheckpointEngineManager(model=model, sampler=sampler)
 dataloader = DataLoader(
 dataset=create_gsm8k_dataset,
 batch_size=BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS,
 min_batch_size=BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS,
 device_mesh=model_mesh,
 remote_group='model',
 )
 advantage_fn = GRPOAdvantage()
 metrics = CompletionRewardMetric()
 sampling_params = SamplingParams(max_tokens=MAX_NEW_TOKENS, num_samples=1, logprobs=1)
 optim_step = 0
 print(get_device_placement())

 for batch in dataloader:
 if optim_step >= MAX_STEPS:
 break
 metrics.reset()
 global_prompts = batch if isinstance(batch, list) else [batch]
 ckpt_manager.sync_weights(merge_and_sync=False)
 sampler.reset_prefix_cache()
 sample_responses = sampler.sample(
 global_prompts*NUM_GENERATIONS,
 sampling_params,
 )
 all_input_data: List[Dict[str, Any]] = []
 all_old_logps: List[List[float]] = []
 all_completion_lengths: List[int] = []

 for sample_response in sample_responses:
 for sequence in sample_response.sequences:
 all_input_data.append(sequence.new_input_feature)
 all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])
 all_completion_lengths.append(len(sequence.tokens))
 total_rewards, format_rewards, accuracy_rewards = compute_rewards(
 all_input_data
 )
 metrics.accumulate(
 completion_lengths=all_completion_lengths,
 rewards={
 'total': total_rewards,
 'format': format_rewards,
 'accuracy': accuracy_rewards,
 },
 )
 advantages = advantage_fn(total_rewards, num_generations=NUM_GENERATIONS, scale='group').tolist()
 # Split completions into mini-batches and run one optim step per mini-batch.
 total_completions = len(all_input_data)
 for mb_start in range(0, total_completions, MINI_BATCH_SIZE):
 mb_end = min(mb_start + MINI_BATCH_SIZE, total_completions)
 mb_inputs = all_input_data[mb_start:mb_end]
 mb_old_logps = all_old_logps[mb_start:mb_end]
 mb_advantages = advantages[mb_start:mb_end]

 model.forward_backward(
 inputs=mb_inputs,
 old_logps=mb_old_logps,
 advantages=mb_advantages,
 micro_batch_size=MICRO_BATCH_SIZE,
 )
 model.clip_grad_and_step()
 optim_step += 1

 if optim_step >= MAX_STEPS:
 break
 log_dict = metrics.calculate()
 log_dict.update(model.calculate_metric(is_training=True))
 metrics.reset()
 print(f'[Step {optim_step}/{MAX_STEPS}] {log_dict}')

 print(f'Training completed. optim_steps={optim_step}')
 model.save('grpo-gsm8k-checkpoint')

if __name__ == '__main__':
 main()

In the code above, we provide an RL training example. We can clearly see in the code how data is constructed, how the sampler/model are declared and parameterized, and the construction process for advantage and loss. There is no explicit reference to ray anywhere in this process. We only declared Ray mode during initialization:

twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False)

Developers can customize the construction and invocation methods of components like models. All Transformers and Megatron model parameters can be passed in when constructing the model.

All subsequent Ray calls and data distribution are performed implicitly. Running this script requires having Ray installed beforehand. Then run it like this:

python train.py

Remote Training

A major feature of Twinkle is support for multi-tenant mixed training. Specifically, multiple users can use a single base model for LoRA training, which can greatly reduce server-side deployment costs.

Checkpoint resumption is also supported in client-server training. The recommended flow is to call model.resume_from_checkpoint(resume_path) to restore weights and optimizer state, then call dataloader.resume_from_checkpoint(progress['consumed_train_samples']) to skip consumed data. See and .

Suppose we start a service using eight GPUs. First, we need to start the Ray cluster:

CUDA_VISIBLE_DEVICES=0,1 ray start --head --port=6379 --num-gpus=2
CUDA_VISIBLE_DEVICES=2,3 ray start --address=127.0.0.1:6379 --num-gpus=2
CUDA_VISIBLE_DEVICES="" ray start --address=127.0.0.1:6379 --num-gpus=0

We started a Ray cluster containing three nodes:

GPUs 0 and 1 as one node
GPUs 2 and 3 as one node
CPU resources as one node

For production environments, you can start more nodes and deploy more replicas to accommodate larger user volumes. Here we only use four GPUs as an example.

Next, start the server:

twinkle-server launch -c cookbook/client/server/transformer/server_config.yaml

For details on how to write server_config.yaml, see .

The server will start three services: a sampler cluster, a model cluster, and a utility cluster.

Now you can perform client-side training:

import dotenv
dotenv.load_dotenv('.env')
import re
from twinkle.data_format import Trajectory
from twinkle.reward.base import Reward
import gc
from peft import LoraConfig
from typing import List, Tuple

from twinkle import get_logger
from twinkle.advantage import GRPOAdvantage
from twinkle.dataset import DatasetMeta
from twinkle.metric import CompletionRewardMetric
from twinkle_client import init_twinkle_client
from twinkle_client.dataloader import DataLoader
from twinkle_client.dataset import Dataset
from twinkle_client.model import MultiLoraTransformersModel
from twinkle_client.sampler import vLLMSampler

logger = get_logger()

# ========== Configuration ==========
MODEL_ID = 'ms://Qwen/Qwen3.5-4B'
NUM_GENERATIONS = 4
MAX_NEW_TOKENS = 1024
LEARNING_RATE = 1e-5
MAX_STEPS = 10
BATCH_SIZE = 2
TEMPERATURE = 1.0
SYNC_INTERVAL = 1 # Save weights for sampler every N steps
GRADIENT_ACCUMULATION_STEPS = 4


def create_countdown_dataset():
 """Create Countdown Game dataset for GRPO training."""

 dataset = Dataset(dataset_meta=DatasetMeta('ms://zouxuhong/Countdown-Tasks-3to4', data_slice=range(500)))
 dataset.set_template('Qwen3_5Template', model_id=MODEL_ID, max_length=8192)
 dataset.map('CountdownProcessor')
 dataset.encode(add_generation_prompt=True, batched=True)
 return dataset


class CountDownAccuracy(Reward):

 @staticmethod
 def countdown_accuracy_reward(completion: str, target: int, nums: List[int]) -> float:
 """Accuracy reward: checks if equation is correct."""
 try:
 match = re.search(r'<answer>(.*?)<\/answer>', completion)
 if match is None:
 return 0.0
 equation = match.group(1).strip()
 if '=' in equation:
 equation = equation.split('=')[0]
 used_numbers = [int(n) for n in re.findall(r'\d+', equation)]
 if sorted(used_numbers) != sorted(nums):
 return 0.0
 if not re.match(r'^[\d+\-*/().\s]+$', equation):
 return 0.0
 result = eval(equation, {'__builtins__': None}, {})
 return 1.0 if abs(float(result) - float(target)) < 1e-5 else 0.0
 except Exception: # noqa
 return 0.0

 def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]):
 rewards = []
 for trajectory in trajectories:
 messages = trajectory.get('messages', [])
 completion = ''
 for msg in reversed(messages):
 if msg.get('role') == 'assistant':
 completion = msg.get('content', '')
 break
 user_data = trajectory.get('user_data', [{}])
 data = user_data[0] if isinstance(user_data, list) and user_data else {}
 target = data.get('target', 0)
 nums = data.get('nums', [])
 acc_reward = self.countdown_accuracy_reward(completion, target, nums)
 rewards.append(acc_reward)
 return rewards


def compute_rewards(trajectories: List[dict], ) -> Tuple[List[float], List[float], List[float]]:
 """Compute format and accuracy rewards for Countdown game."""
 from twinkle.reward import FormatReward
 format_rewards = FormatReward()(trajectories, [])
 accuracy_rewards = CountDownAccuracy()(trajectories, [])
 total_rewards = [a + b for a, b in zip(accuracy_rewards, format_rewards)]
 return total_rewards, format_rewards, accuracy_rewards


def train():
 # Step 1: Initialize the Twinkle client
 client = init_twinkle_client(
 base_url='http://localhost:8000',
 api_key='',
 )

 # Step 2: Prepare dataset and dataloader
 dataset = create_countdown_dataset()
 dataloader = DataLoader(dataset=dataset, batch_size=BATCH_SIZE)

 # Step 3: Configure the training model
 model = MultiLoraTransformersModel(model_id=MODEL_ID)

 lora_config = LoraConfig(
 target_modules='all-linear',
 r=8,
 lora_alpha=32,
 lora_dropout=0.05,
 )
 model.add_adapter_to_model(
 'default',
 lora_config,
 gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
 )

 # Set GRPO loss (the key difference from SFT training)
 model.set_loss('GRPOLoss', epsilon=0.2, beta=0.0)

 # Set optimizer and LR scheduler
 model.set_optimizer('AdamW', lr=LEARNING_RATE)
 model.set_lr_scheduler(
 'CosineWarmupScheduler',
 num_warmup_steps=500,
 num_training_steps=MAX_STEPS,
 )

 # Set processor and template for encoding inputs
 model.set_processor('InputProcessor')
 model.set_template('Qwen3_5Template', model_id=MODEL_ID)

 # Step 4: Configure the sampler
 sampler = vLLMSampler(model_id=MODEL_ID)
 sampler.set_template('Qwen3_5Template', model_id=MODEL_ID)

 # Step 5: Setup metrics and advantage function
 advantage_fn = GRPOAdvantage()
 metrics = CompletionRewardMetric()

 sampling_params = {
 'max_tokens': MAX_NEW_TOKENS,
 'temperature': TEMPERATURE,
 'top_p': 0.95,
 }

 # Track the current adapter path for sampling
 current_adapter_uri = None

 step = 0
 for batch in dataloader:
 if step >= MAX_STEPS:
 break

 metrics.reset()
 prompts = batch if isinstance(batch, list) else [batch]

 # ========== 1. Save weights and update adapter_uri ==========
 # Instead of sync_weights, save the model checkpoint and pass
 # the resulting path to the sampler as adapter_uri
 if step % SYNC_INTERVAL == 0:
 logger.info(f'Step {step}: Saving weights for sampler...')
 twinkle_path = model.save(
 name=f'grpo-sampler-step-{step}',
 save_optimizer=False,
 )
 current_adapter_uri = twinkle_path
 logger.info(f'Step {step}: Saved weights to {current_adapter_uri}')

 # ========== 2. Sample completions ==========
 sample_response = sampler.sample(
 inputs=prompts,
 sampling_params=sampling_params,
 adapter_uri=current_adapter_uri,
 num_samples=NUM_GENERATIONS,
 )

 input_features = []
 old_logps_list = []
 completion_lengths = []

 sequences = sample_response.get('sequences', [])
 for seq in sequences:
 input_features.append(seq.get('new_input_feature', seq))
 old_logps_list.append(seq.get('logprobs', []))
 completion_lengths.append(len(seq.get('tokens', [])))

 if not input_features:
 logger.warning(f'Step {step}: No valid samples, skipping')
 step += 1
 continue

 # ========== 3. Compute rewards ==========
 total_rewards, format_rewards, accuracy_rewards = compute_rewards(input_features)
 metrics.accumulate(
 None,
 None,
 completion_lengths=completion_lengths,
 rewards={
 'total': total_rewards,
 'format': format_rewards,
 'accuracy': accuracy_rewards,
 })

 # ========== 4. Compute advantages ==========
 advantages = advantage_fn(
 total_rewards,
 num_generations=NUM_GENERATIONS,
 scale='group',
 ).tolist()

 frac_zero_std = (1.0 if all(abs(a) < 1e-8 for a in advantages) else 0.0)
 if frac_zero_std == 1.0:
 logger.info(f'Step {step}: All advantages are zero, skipping training')
 step += 1
 continue

 # ========== 5. Training step (GRPO) ==========
 # forward_backward with GRPO loss: passes advantages and old_logps
 # to the server-side GRPOLoss for proper policy optimization
 model.forward_backward(
 inputs=input_features,
 advantages=advantages,
 old_logps=old_logps_list,
 )

 # Gradient clipping and optimizer step
 model.clip_grad_norm(1.0)
 model.step()
 model.zero_grad()
 model.lr_step()

 gc.collect()

 # ========== 6. Log ==========
 log_dict = metrics.calculate()
 log_dict.update(model.calculate_metric())
 log_dict['train/frac_reward_zero_std'] = frac_zero_std
 logger.info(f'Step {step}: {log_dict}')
 step += 1

 # Save final checkpoint
 twinkle_path = model.save(name='grpo-countdown-final', save_optimizer=True)
 logger.info(f'Saved final checkpoint: {twinkle_path}')


if __name__ == '__main__':
 train()

Multiple developers can use a single base model from this service for parallel training and sampling. Furthermore, the training methods they use are allowed to differ. For example, User A can perform SFT, User B can perform RL, and User C can perform sampling. Similarly, Twinkle also supports Tinker-like APIs for remote training:

from tinker import types
from tqdm import tqdm
from tinker import ServiceClient
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.preprocessor import SelfCognitionProcessor
from twinkle.server.common import input_feature_to_datum

# The base model to fine-tune / evaluate
base_model = 'ms://Qwen/Qwen3.5-4B'


def train():
 # Step 1: Prepare the dataset

 # Load the self-cognition dataset from ModelScope (first 500 examples)
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))

 # Apply the chat template matching the base model (max 256 tokens per sample)
 dataset.set_template('Qwen3_5Template', model_id=f'ms://{base_model}', max_length=256)

 # Replace placeholder names with custom model/author identity
 dataset.map(SelfCognitionProcessor('twinkle model', 'twinkle team'), load_from_cache_file=False)

 # Tokenize and encode the dataset into model-ready input features
 dataset.encode(batched=True, load_from_cache_file=False)

 # Wrap the dataset into a DataLoader that yields batches of size 8
 dataloader = DataLoader(dataset=dataset, batch_size=8)

 # Step 2: Initialize the training client
 # Connect to the Twinkle server running locally
 service_client = ServiceClient(base_url='http://localhost:8000', api_key='your-api-key')
 # Create a LoRA training client for the base model (rank=16 for the LoRA adapter)
 training_client = service_client.create_lora_training_client(base_model=base_model, rank=16)

 # Step 3: Run the training loop
 for epoch in range(3):
 print(f'Epoch {epoch}')
 for step, batch in tqdm(enumerate(dataloader)):
 # Convert each InputFeature into a Datum for the Tinker API
 input_datum = [input_feature_to_datum(input_feature) for input_feature in batch]

 # Send data to server: forward + backward pass (computes gradients)
 fwdbwd_future = training_client.forward_backward(input_datum, 'cross_entropy')

 # Optimizer step: update model weights with Adam
 optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))

 # Wait for both operations to complete
 fwdbwd_future.result()
 optim_result = optim_future.result()
 print(f'Training Metrics: {optim_result}')

 # Save a checkpoint after each epoch
 save_future = training_client.save_state(f'twinkle-lora-{epoch}')
 save_result = save_future.result()
 print(f'Saved checkpoint to {save_result.path}')


if __name__ == '__main__':
 train()

Using ModelScope Community’s TaaS Training Service

Concurrent with the open-source release of the Twinkle framework, we also provide a hosted Training as a Service (TaaS) powered by ModelScope’s backend services. Developers can experience Twinkle’s training API for free through this service. This service shares the same code as the Tinker API section described above. The only difference is that the Endpoint and Token need to use the official ModelScope information. For details on how to use the official service, please refer to the detailed description in .

Twinkle provides a sampling API that can be used to control the sampling process more flexibly for result validation, or to participate in the sampling workflow of RL algorithms.

For complete examples of all supported training modes, please refer to the directory.

Using Hugging Face Models

To load models from Hugging Face instead of ModelScope, simply switch the prefix:

ms://Qwen/Qwen3.5-4B -> hf://Qwen/Qwen3.5-4B

All components that accept a model_id parameter support this prefix-based routing.

🛠️ Twinkle✨ Modular Ecosystem

Dataset _{Data loading and preprocessing}	Template _{Encoding and decoding}	DataLoader _{Data distribution and batching}	Preprocessor _{Data ETL}	InputProcessor _{Task-specific input processing}
Model _{Large models, supports multiple frameworks}	Sampler _{Sampler logic}	Loss _{Loss functions}	Metric _{Training metrics collection}	Reward _{Reward function}
Advantage _{Advantage function}	CheckpointEngine _{Weight synchronization}	Patch _{Patches for model fixes}	Module _{Components, e.g., Optimizer}	Kernel _Operators
Server _{Start backend cluster}	Client _{Client code}	Infra _{Isolate ray and torchrun differences}	Plugin _{Use hub components}	Hub _{Interface with HF/MS libraries}

Twinkle’s Customizable Components

In Twinkle’s design, training via torchrun, Ray, and HTTP uses the same API and shares the same components and input/output structures. Therefore, many of its components can be customized by developers to implement new algorithms.

Below is a list of recommended components for customization:

Component Name	Base Class	Description
Loss	twinkle.loss.Loss	Used to define loss functions for model training
Metric	twinkle.metric.Metric	Used to define evaluation systems for model training
Optimizer/LRScheduler	Based on PyTorch	Used to define optimizers and LR schedulers for model training
Patch	twinkle.patch.Patch	Used to fix issues during model training
Preprocessor	twinkle.preprocessor.Preprocessor	Used for data preprocessing (ETL) and returns standard format usable by Template
Filter	twinkle.preprocessor.Filter	Used to filter raw data for reasonableness
Task Data Processor	twinkle.processor.InputProcessor	Used to convert model inputs to data required by each task and add extra fields
Model	twinkle.model.TwinkleModel	The large model itself
Sampler	twinkle.sampler.Sampler	Sampler, e.g., vLLM
Reward	twinkle.reward.Reward	Used to implement rewards for different RL training
Advantage	twinkle.advantage.Advantage	Used to implement advantage estimation for different RL training
Template	twinkle.template.Template	Used to process standard inputs and convert them to tokens required by the model
Weight Synchronization	twinkle.checkpoint_engine.CheckpointEngine	Used for weight synchronization in RL training

Components not listed in the above table, such as Dataset, DataLoader, etc., can also be customized; simply follow the base class API design.

DeviceGroup and DeviceMesh

DeviceGroup and DeviceMesh are the core concepts of Twinkle’s architecture. All code construction is based on these two designs.

import twinkle
from twinkle import DeviceMesh, DeviceGroup
device_group = [
 DeviceGroup(
 name='default',
 ranks=8,
 device_type='cuda',
 )
 ]

device_mesh = DeviceMesh.from_sizes(pp_size=2, tp_size=2, dp_size=2)
twinkle.initialize(mode='ray', nproc_per_node=8, groups=device_group)

After defining the device_group, you need to use twinkle.initialize to initialize resources.

DeviceGroup: Defines how many resource groups are needed for this training session. Once defined, components can run themselves remotely by selecting a resource group:

from twinkle.model import TransformersModel
model = TransformersModel(model_id='Qwen/Qwen3.5-4B', remote_group='default', device_mesh=device_mesh)
# Or
from twinkle.model import MegatronModel
model = MegatronModel(model_id='Qwen/Qwen3.5-4B', remote_group='default', device_mesh=device_mesh)

DeviceMesh specifies the topology of components like models within the resource group. It can be understood as how to perform parallelization. This affects a series of framework decisions such as data acquisition, data consumption, and data return.

Usage Example

from peft import LoraConfig
import twinkle
from twinkle import DeviceMesh, DeviceGroup
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.preprocessor import SelfCognitionProcessor

device_group = [DeviceGroup(name='default',ranks=8,device_type='cuda')]
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)
# local for torchrun
twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_mesh)


def train():
 # 1000 samples
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
 # Set template to prepare encoding
 dataset.set_template('Qwen3_5Template', model_id='Qwen/Qwen3.5-4B')
 # Preprocess the dataset to standard format
 dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
 # Encode dataset
 dataset.encode()
 # Global batch size = 8, for GPUs, so 1 sample per GPU
 dataloader = DataLoader(dataset=dataset, batch_size=8, min_batch_size=8)
 # Use a TransformersModel
 model = TransformersModel(model_id='Qwen/Qwen3.5-4B', remote_group='default')

 lora_config = LoraConfig(
 r=8,
 lora_alpha=32,
 target_modules='all-linear'
 )

 # Add a lora to model, with name `default`
 # Comment this to use full-parameter training
 model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
 # Add Optimizer for lora `default`
 model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
 # Add LRScheduler for lora `default`
 model.set_lr_scheduler(scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5,
 num_training_steps=len(dataloader))
 for step, batch in enumerate(dataloader):
 # Do forward and backward
 model.forward_backward(inputs=batch)
 # Step
 model.clip_grad_and_step()
 if step % 20 == 0:
 # Print metric
 metric = model.calculate_metric(is_training=True)
 print(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
 model.save(f'last-checkpoint')


if __name__ == '__main__':
 train()

Start training like this:

python3 train.py

Twinkle Installation

Mon, 01 Jan 0001 00:00:00 +0000

Wheel Package Installation

You can install using pip:

pip install 'twinkle-kit'

Installation from Source

git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e .

Docker Image

You can also use our pre-built Docker image:

modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:twinkle-0.3.0

Client Installation

If you need to use Twinkle’s Client for remote training, you can use our one-click installation script:

# Mac or Linux
sh INSTALL_CLIENT.sh
# Windows, Open with PowerShell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
.\INSTALL_CLIENT.ps1

This script will download or utilize conda to create a virtual environment called twinkle-client, which can be directly used for remote training.

Megatron Dependencies

If you need to install Megatron-related dependencies, you can use the following script:

sh INSTALL_MEGATRON.sh

Supported Hardware

Hardware Environment	Notes
GPU A10/A100/H100/RTX series
GPU T4/V100	Does not support bfloat16, Flash-Attention
Ascend NPU	Some operators not supported
PPU	Supported
CPU	Supports partial components like dataset, dataloader

NPU (Ascend) Quick Start Guide

Mon, 01 Jan 0001 00:00:00 +0000

This document describes how to install and use the Twinkle framework in Huawei Ascend NPU environments.

Environment Requirements

Before getting started, please ensure your system meets the following requirements:

Component	Version Requirement	Description
Python	>= 3.11, < 3.13	Twinkle framework requirement
Ascend Firmware Driver (HDK)	Latest version recommended	Hardware driver and firmware
CANN Toolkit	8.5.1 or higher	Heterogeneous Computing Architecture
PyTorch	2.7.1	Deep learning framework
torch_npu	2.7.1	Ascend PyTorch adapter plugin

Important Notes:

torch and torch_npu versions must be exactly the same (e.g., both 2.7.1)
Python 3.11 is recommended for best compatibility
CANN toolkit requires approximately 10GB+ disk space

Supported Hardware

Twinkle currently supports the following Ascend NPU devices:

Ascend 910 series
Other compatible Ascend accelerator cards

Installation Steps

1. Install NPU Environment (Driver, CANN, torch_npu)

NPU environment installation includes Ascend driver, CANN toolkit, PyTorch, and torch_npu.

📖 Complete Installation Tutorial:

This documentation includes:

Ascend driver (HDK) installation steps
CANN toolkit installation steps
PyTorch and torch_npu installation steps
Version compatibility instructions

Recommended Version Configuration:

Python: 3.11
PyTorch: 2.7.1
torch_npu: 2.7.1
CANN: 8.5.1 or higher

2. Install Twinkle

After NPU environment configuration is complete, install the Twinkle framework from source:

git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e ".[transformers,ray]"

3. Install vLLM and vLLM-Ascend (Optional)

If you need to use vLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend.

Installation Steps:

# Step 1: Install vLLM
pip install vllm==0.14.0

# Step 2: Install vLLM-Ascend
pip install vllm-ascend==0.14.0rc1

Notes:

Install in the above order, ignoring possible dependency conflict warnings
Ensure CANN environment is activated before installation: source /usr/local/Ascend/ascend-toolkit/set_env.sh
Recommended versions are vLLM 0.14.0 and vLLM-Ascend 0.14.0rc1

4. Verify Installation

Create test script verify_npu.py:

import torch
import torch_npu

print(f"PyTorch version: {torch.__version__}")
print(f"torch_npu version: {torch_npu.__version__}")
print(f"NPU available: {torch.npu.is_available()}")
print(f"NPU device count: {torch.npu.device_count()}")

if torch.npu.is_available():
 print(f"Current NPU device: {torch.npu.current_device()}")
 print(f"NPU device name: {torch.npu.get_device_name(0)}")

 # Simple test
 x = torch.randn(3, 3).npu()
 y = torch.randn(3, 3).npu()
 z = x + y
 print(f"NPU computation test passed: {z.shape}")

Run verification:

python verify_npu.py

If the output shows NPU available: True and no errors, installation is successful!

Note: Twinkle does not currently provide NPU Docker images. Manual installation is recommended. For containerized deployment, please refer to official images from the Ascend community.

5. Install Megatron Backend Dependencies

Recommended versions:

Megatron-LM: v0.15.3
MindSpeed: core_r0.15.3
mcore-bridge: main branch or the version already validated in your Twinkle checkout

Installation steps:

# 1. Clone Megatron-LM and pin the compatible version
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout v0.15.3
cd ..

# 2. Clone and install MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed
git checkout core_r0.15.3
pip install -e .
cd ..

# 3. Clone and install mcore-bridge
git clone https://github.com/modelscope/mcore-bridge.git
cd mcore-bridge
pip install -e .
cd ..

# 4. Install Twinkle if needed
cd twinkle
pip install -e ".[transformers,ray]"

Runtime environment variables:

export PYTHONPATH=$PYTHONPATH:<path/to/Megatron-LM>
export MEGATRON_LM_PATH=</path/to/Megatron-LM>
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Verification:

First run a minimal import check to make sure the current environment can resolve MindSpeed and Megatron-LM:

python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')"

6. Qwen3.5/3.6 FLA and Triton-Ascend Version Compatibility

FLA Enablement Conditions

To use FLA (Flash Linear Attention) with Qwen3.5/3.6 on the transformers backend, the following conditions must be met:

Install triton-ascend
mindspeed version 26.0.0_core_r0.12.1

Triton-Ascend Version and CANN Compatibility

triton-ascend	CANN	Additional Dependencies
3.2.0	8.5.x	Do not install `triton`
3.2.1	9.0.0	`triton` must be installed

MindSpeed Version and Code Adaptation

The currently validated MindSpeed version is 26.0.0_core_r0.12.1. MindSpeed repository:

If using a higher MindSpeed version, note that the following import paths in src/twinkle/kernel/chunk_gated_delta_rule.py may need to be adjusted to match the actual code locations in MindSpeed:

from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h
from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o
from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd
from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum
from mindspeed.lite.ops.triton.solve_tril import solve_tril
from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard
from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd

7. NPU Patch Environment Variable Configuration

Twinkle enables model-layer patches by default in NPU environments. The following environment variables provide fine-grained control:

Environment Variable	Description	Default
`TWINKLE_NPU_PATCH`	Master switch for all NPU optimizations	`1` (enabled)
`TWINKLE_NPU_FUSED_OPS`	Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA)	`1` (enabled)
`TWINKLE_NPU_MOE_PATCH`	Enable MoE Grouped MatMul	`1` (enabled)
`TWINKLE_NPU_FLA`	Enable Qwen3.5 Flash Linear Attention; set to `0` to force torch fallback	`1` (enabled)

Usage examples:

# Disable all NPU optimizations and fall back to native Transformers
export TWINKLE_NPU_PATCH=0

# Disable FLA only while keeping other fused operators
export TWINKLE_NPU_FLA=0

# Disable MoE patch only
export TWINKLE_NPU_MOE_PATCH=0

Quick Start

Important Notice: The following examples are from the cookbook/ directory and have been verified in actual NPU environments. It is recommended to run scripts directly from the cookbook rather than copying and pasting code snippets.

SFT LoRA Fine-tuning

The NPU document no longer provides this kind of SFT cookbook example; this capability should be described together with an actually available cookbook example or a future NPU script.

GRPO Reinforcement Learning Training

The NPU document no longer provides this kind of GRPO cookbook example; this capability should be described together with an actually available cookbook example or a future NPU script.

More Examples

Check the cookbook/remote/tinker/ascend/ directory for remote training server-side configuration.

Parallelization Strategies

Twinkle currently supports the following verified parallelization strategies on NPU:

Parallel Type	Description	NPU Support	Verification Status
DP (Data Parallel)	Data parallelism	✅	No corresponding cookbook example
FSDP (Fully Sharded Data Parallel)	Fully sharded data parallelism	✅	No corresponding cookbook example
TP (Tensor Parallel)	Tensor parallelism (Megatron)	✅	Verified (see `cookbook/megatron/ascend/tp_npu.py`)
PP (Pipeline Parallel)	Pipeline parallelism (Megatron)	✅	Verified (see `cookbook/megatron/ascend/tp_npu.py`)
CP (Context Parallel)	Context parallelism	✅	Verified (see `cookbook/megatron/ascend/tp_moe_cp_npu.py`)
EP (Expert Parallel)	Expert parallelism (MoE)	✅	Verified (see `cookbook/megatron/ascend/tp_moe_npu.py`)

Legend:

✅ Verified: Has actual running example code
🚧 To be verified: Theoretically supported but no NPU verification example yet
❌ Not supported: Not available in current version

DP + FSDP Example

The NPU document currently does not provide a corresponding cookbook code snippet.

Megatron backend note: Twinkle now provides runnable NPU smoke scripts for the Megatron backend. Please follow the installation section above before running the cookbook examples, and start with cookbook/megatron/ascend/tp_npu.py before moving on to cookbook/megatron/ascend/tp_moe_npu.py and cookbook/megatron/ascend/tp_moe_cp_npu.py.

Common Issues

1. torch_npu Version Mismatch

Problem: Version incompatibility warnings or errors after installing torch_npu.

Solution:

Ensure torch and torch_npu versions are exactly the same
Check if CANN version is compatible with torch_npu

# Check current versions
python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"

# Reinstall matching versions
pip uninstall torch torch_npu -y
pip install torch==2.7.1
pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl

2. CANN Toolkit Version Issue

Problem: CANN version incompatible with torch_npu.

Solution:

Refer to
Install corresponding CANN toolkit version

Feature Support Status

Feature support matrix based on actual code verification:

Feature	GPU	NPU	Verification Example	Description
SFT + LoRA	✅	✅	-	No corresponding cookbook example
GRPO	✅	✅	-	No corresponding cookbook example
DP Parallelism	✅	✅	-	No corresponding cookbook example
FSDP Parallelism	✅	✅	-	No corresponding cookbook example
Ray Distributed	✅	✅	-	No corresponding cookbook example
TorchSampler	✅	✅	-	No corresponding cookbook example
vLLMSampler	✅	✅	-	No corresponding cookbook example
Full Fine-tuning	✅	✅	-	Verified available
QLoRA	✅	❌	-	Quantization operators not yet supported
DPO	✅	🚧	-	Theoretically supported, to be verified
Megatron TP/PP	✅	🚧	-	To be adapted and verified
Flash Attention	✅	⚠️	-	Some operators not supported

Legend:

✅ Verified: Has actual running example, confirmed available
🚧 To be verified: Theoretically supported but no NPU environment verification yet
⚠️ Partial support: Available but with limitations or performance differences
❌ Not supported: Not available in current version

Usage Recommendations:

Prioritize features marked as “Verified” for guaranteed stability
“To be verified” features can be attempted but may encounter compatibility issues
Refer to corresponding example code when encountering problems

Example Code

Twinkle’s verified NPU examples currently focus on the Megatron smoke path; the SFT and GRPO cookbook examples do not have corresponding files yet.

Remote Training (Tinker Protocol)

Server Configuration:
- Provides HTTP API interface
- Supports remote training and inference
- Suitable for production environment deployment

Running Examples: No corresponding command examples are provided yet.

Reference Resources

Getting Help

If you encounter issues during use:

Check Logs: Set environment variable ASCEND_GLOBAL_LOG_LEVEL=1 for detailed logs
Submit Issue:
Community Discussion:

Next Steps

📖 Read for more training examples
📖 Read for other platform installations
🚀 Browse the cookbook/ directory for complete example code
💡 Check for advanced features

Twinkle Training Service on ModelScope

Mon, 01 Jan 0001 00:00:00 +0000

Alongside the open-source release of the Twinkle framework, we also provide a hosted model training service (Training as a Service) powered by ModelScope’s backend infrastructure. Developers can use this service to experience Twinkle’s training API for free.

The model currently running on the cluster is . Below are the detailed usage instructions:

Step 1. Register a ModelScope Account and Obtain Your API Key

Developers first need to register as a ModelScope user. You can also use Twinkle✨ by deploying the service locally.

Registration link:

After registering, obtain your API-Key (i.e., the ModelScope platform access token) from this page: .

API endpoint: base_url="https://www.modelscope.cn/twinkle"

Step 2. Review the Cookbook and Customize Development

We strongly recommend that developers check out our and build upon the training code provided there for secondary development.

Sample code:

import os
from tqdm import tqdm
from tinker import types
from twinkle_client import init_tinker_client
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.preprocessor import SelfCognitionProcessor
from twinkle.server.common import input_feature_to_datum

base_model = 'ms://Qwen/Qwen3.6-27B'
base_url='https://www.modelscope.cn/twinkle'
api_key=os.environ.get('MODELSCOPE_TOKEN')

# Use twinkle dataset to load the data
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))
dataset.set_template('Qwen3_5Template', model_id=base_model, max_length=256)
dataset.map(SelfCognitionProcessor('Twinkle Model', 'ModelScope Team'), load_from_cache_file=False)
dataset.encode(batched=True, load_from_cache_file=False)
dataloader = DataLoader(dataset=dataset, batch_size=8)

# Initialize Tinker client before importing ServiceClient
init_tinker_client()
from tinker import ServiceClient

service_client = ServiceClient(base_url=base_url, api_key=api_key)
training_client = service_client.create_lora_training_client(base_model=base_model[len('ms://'):], rank=16)

# Training loop: use input_feature_to_datum to transfer the input format
for epoch in range(2):
 for step, batch in tqdm(enumerate(dataloader)):
 input_datum = [input_feature_to_datum(input_feature) for input_feature in batch]

 fwdbwd_future = training_client.forward_backward(input_datum, "cross_entropy")
 optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))

 fwdbwd_result = fwdbwd_future.result()
 optim_result = optim_future.result()
 print(f'Training Metrics: {optim_result}')

 result = training_client.save_state(f"twinkle-lora-{epoch}").result()
 print(f'Saved checkpoint for epoch {epoch} to {result.path}')

With the code above, you can train a self-cognition LoRA based on Qwen/Qwen3.6-27B. This LoRA will change the model’s name and creator to the names specified during training. To perform inference using this LoRA:

import os
from tinker import types

from twinkle.data_format import Message, Trajectory
from twinkle.template import Template
from twinkle import init_tinker_client

# Step 1: Initialize Tinker client
init_tinker_client()

from tinker import ServiceClient

base_model = 'Qwen/Qwen3.6-27B'
base_url = 'https://www.modelscope.cn/twinkle'

# Step 2: Define the base model and connect to the server
service_client = ServiceClient(
 base_url=base_url,
 api_key=os.environ.get('MODELSCOPE_TOKEN')
)

# Step 3: Create a sampling client by loading weights from a saved checkpoint.
# The model_path is a twinkle:// URI pointing to a previously saved LoRA checkpoint.
# The server will load the base model and apply the LoRA adapter weights.
sampling_client = service_client.create_sampling_client(
 model_path='twinkle://xxx-Qwen_Qwen3.6-35B-A3B-xxx/weights/twinkle-lora-1',
 base_model=base_model
)

# Step 4: Load the tokenizer locally to encode the prompt and decode the results
print(f'Using model {base_model}')

template = Template(model_id=f'ms://{base_model}')

trajectory = Trajectory(
 messages=[
 Message(role='system', content='You are a helpful assistant'),
 Message(role='user', content='Who are you?'),
 ]
)

input_feature = template.batch_encode([trajectory], add_generation_prompt=True)[0]

input_ids = input_feature['input_ids'].tolist()

# Step 5: Prepare the prompt and sampling parameters
prompt = types.ModelInput.from_ints(input_ids)
params = types.SamplingParams(
 max_tokens=128, # Maximum number of tokens to generate
 temperature=0.7,
 stop=['\n'] # Stop generation when a newline character is produced
)

# Step 6: Send the sampling request to the server.
# num_samples=1 generates 1 independent completion for the same prompt.
print('Sampling...')
future = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=1)
result = future.result()

# Step 7: Decode and print the generated responses
print('Responses:')
for i, seq in enumerate(result.sequences):
 print(f'{i}: {repr(template.decode(seq.tokens))}')

Developers can also merge this LoRA with the base model and then deploy it using their own service, calling it through the OpenAI-compatible standard API.

The ModelScope server is currently Tinker-compatible, so please use the Tinker cookbooks. In a future version, we will support a server that works for both Twinkle and Tinker clients.

Developers can customize datasets, advantage functions, rewards, templates, and more. However, the Loss component is not currently customizable since it needs to be executed on the server side (for security reasons). If you need support for additional Loss functions, you can upload your Loss implementation to and contact us via the Q&A group or through an to have the corresponding component added to the whitelist.

Appendix: Supported Training Methods

This model is a text-only model, so multimodal tasks are not currently supported. For text-only tasks, you can train using:

Standard PT/SFT training methods, including Agentic training
Self-sampling RL algorithms such as GRPO/RLOO
Distillation methods like GKD/On-policy. Since the official ModelScope endpoint only supports a single model, the other Teacher/Student model must be prepared by the developer

The current official environment only supports LoRA training, with the following requirements:

Maximum rank = 32
modules_to_save is not supported

Qwen3.5 Training Best Practices

Mon, 01 Jan 0001 00:00:00 +0000

Using Qwen3.5-4B as an example, this guide demonstrates the core capability of the Twinkle framework: one component-based code, used from single GPU training to Client-Server mode.

1. What is Twinkle

Twinkle is a production-oriented large model training framework. Its core design is straightforward: training logic is expressed in Python code, and the runtime mode is switched via initialization parameters.

This means:

A training script written in the lab can be used to ray and server training by changing a single line
Open to customize your training algorithm
No need to maintain separate codebases to support different modes like torchrun, Ray, or HTTP
Algorithm engineers focus on training logic; the framework handles distributed communication automatically

Twinkle supports both Transformers and Megatron backends, as well as multi-tenant LoRA training — multiple users share a single base model while each trains their own adapter.

2. Local Multi-GPU Training

Overview

Training on 1–8 local GPUs or NPUs. Twinkle is built on PyTorch native interfaces and supports parallel strategies such as FSDP2 and DDP.

Full Code

from peft import LoraConfig
from tqdm import tqdm

import twinkle
from twinkle import DeviceMesh, get_device_placement, get_logger
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.preprocessor import SelfCognitionProcessor

# Build device_mesh: fsdp=4, dp=2, using 8 GPUs in total
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)
# Use torchrun mode
twinkle.initialize(mode='local', global_device_mesh=device_mesh)

logger = get_logger()


def eval(model):
 # Validation set: 100 samples
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(100)))
 dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B')
 dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
 dataset.encode()
 dataloader = DataLoader(dataset=dataset, batch_size=8)
 for step, batch in tqdm(enumerate(dataloader)):
 model.forward_only(inputs=batch)
 model.calculate_loss()
 metrics = model.calculate_metric(is_training=False)
 return metrics


def train():
 # Training set: 1000 samples
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
 # Set template to prepare encoding
 dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B')
 # Preprocess: replace placeholders in self-cognition data
 dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community'))
 # Encode dataset
 dataset.encode()
 # Global batch size = 8; each of the 8 GPUs processes 1 sample
 dataloader = DataLoader(dataset=dataset, batch_size=8)
 # Load model
 model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
 model.model._no_split_modules = {'Qwen3_5DecoderLayer'}

 lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear')

 # Add LoRA adapter named 'default'
 # Comment this out to switch to full-parameter training
 model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
 # Configure optimizer for LoRA
 model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
 # Configure learning rate scheduler
 model.set_lr_scheduler(
 scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader))
 logger.info(get_device_placement())
 # Print training config
 logger.info(model.get_train_configs())
 logger.info(f'Total steps: {len(dataloader)}')
 loss_metric = 99.0
 # LoRA training: ~8G * 8 GPU memory
 # Full-parameter training: ~18G * 8 GPU memory
 for step, batch in enumerate(dataloader):
 # Forward + backward pass
 model.forward_backward(inputs=batch)
 # Gradient clipping + optimizer step
 model.clip_grad_and_step()
 if step % 20 == 0:
 # Print training metrics
 metric = model.calculate_metric(is_training=True)
 logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
 if step > 0 and step % 40 == 0:
 # Periodic evaluation
 metrics = eval(model)
 logger.info(f'Eval metric: {metrics}')
 metrics['step'] = step
 # Save best checkpoint
 if loss_metric > float(metrics['loss']):
 model.save(f'checkpoint-{step}')
 loss_metric = float(metrics['loss'])
 model.save(f'last-checkpoint')


if __name__ == '__main__':
 train()

Launch Command

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 fsdp2.py

Key Design Notes

DeviceMesh Parallelism Strategy

device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)

A hybrid parallel strategy with 4-way FSDP sharding + 2-way data parallelism. Qwen3.5-4B weights occupy ~8GB in bf16 precision. In LoRA mode, single-GPU memory usage is around 18GB — 8× A100/H100 handles it comfortably.

Gradient Accumulation

model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)

gradient_accumulation_steps=2 updates parameters every 2 micro-batches, effectively doubling the batch size. Useful when GPU memory is constrained but a larger effective batch is desired.

Algorithm Transparency

All key training steps — forward pass, backward pass, gradient clipping, checkpoint saving — are written directly in the main loop. Developers retain full control over the training process. The underlying distributed communication is handled by Twinkle’s infra layer; switching between Ray and torchrun has no impact on the main loop.

For complex algorithms, this transparency is especially important.

RL Training: Reinforcement Learning with Ray

Twinkle supports multiple RL algorithms, including GRPO, RLOO, GSPO, and more. Here we use GRPO (Group Relative Policy Optimization) as an example — the core RL algorithm used in DeepSeek-R1 — to show how RL training works in Ray mode.

Unlike PPO, GRPO does not require training a separate value model. Instead, it estimates the advantage function using relative rewards within a sampled group, simplifying the training pipeline and reducing memory overhead. Twinkle’s Ray mode is particularly well-suited for RL algorithms that require model and sampler to run on separate devices. In the example below, 4 GPUs run model training while another 4 run vLLM sampling, coordinated through a Ray cluster:

from typing import List, Dict, Any
from peft import LoraConfig
import twinkle
from twinkle import DeviceMesh, DeviceGroup, get_device_placement, get_logger
from twinkle.advantage import GRPOAdvantage
from twinkle.checkpoint_engine import CheckpointEngineManager
from twinkle.data_format import SamplingParams
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.processor import InputProcessor
from twinkle.reward import GSM8KAccuracyReward, GSM8KFormatReward
from twinkle.sampler import vLLMSampler
from twinkle.template import Template
from twinkle.metric import CompletionRewardMetric
from twinkle.preprocessor.llm import GSM8KProcessor

logger = get_logger()

MODEL_ID = 'ms://Qwen/Qwen3.5-4B'
MODEL_GPUS = 4 # 4 GPUs for model training
SAMPLER_GPUS = 4 # 4 GPUs for vLLM sampling
NUM_GPUS = MODEL_GPUS + SAMPLER_GPUS

NUM_GENERATIONS = 8 # 8 samples per group
MAX_NEW_TOKENS = 4096
LEARNING_RATE = 1e-5
MAX_STEPS = 200
BATCH_SIZE = 16
MINI_BATCH_SIZE = 16
MICRO_BATCH_SIZE = 2
ADAPTER_NAME = 'default'

def create_gsm8k_dataset():
 dataset = Dataset(DatasetMeta('ms://modelscope/gsm8k', subset_name='main', split='train'))
 dataset.set_template('Qwen3_5Template', model_id=MODEL_ID, max_length=2048)
 dataset.map(GSM8KProcessor())
 dataset.encode(add_generation_prompt=True)
 return dataset

def compute_rewards(trajectories: List[Dict[str, Any]]):
 accuracy_reward_fn = GSM8KAccuracyReward()
 format_reward_fn = GSM8KFormatReward()
 accuracy_rewards = accuracy_reward_fn(trajectories)
 format_rewards = format_reward_fn(trajectories)
 total_rewards = [a + f for a, f in zip(accuracy_rewards, format_rewards)]
 return total_rewards, format_rewards, accuracy_rewards

def main():
 # Assign model and sampler to separate GPU groups
 device_groups = [
 DeviceGroup(name='model', ranks=list(range(MODEL_GPUS)), device_type='GPU'),
 DeviceGroup(name='sampler', ranks=list(range(MODEL_GPUS, NUM_GPUS)), device_type='GPU'),
 ]
 model_mesh = DeviceMesh.from_sizes(world_size=MODEL_GPUS, dp_size=MODEL_GPUS)
 sampler_mesh = DeviceMesh.from_sizes(world_size=SAMPLER_GPUS, dp_size=SAMPLER_GPUS)

 # Initialize in Ray mode
 twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False)

 lora_config = LoraConfig(target_modules='all-linear', r=32, lora_alpha=64, lora_dropout=0.05)

 # Model deployed in the 'model' group
 model = TransformersModel(model_id=MODEL_ID, device_mesh=model_mesh, remote_group='model')
 model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)
 model.set_optimizer('AdamW', lr=LEARNING_RATE)
 model.set_lr_scheduler('CosineAnnealingLR', T_max=MAX_STEPS, eta_min=0)
 model.set_loss('GRPOLoss', epsilon=0.2)
 model.set_processor(InputProcessor)
 model.set_template('Qwen3_5Template', model_id=MODEL_ID)

 # Sampler deployed in the 'sampler' group
 sampler = vLLMSampler(
 model_id=MODEL_ID,
 engine_args={
 'gpu_memory_utilization': 0.8,
 'max_model_len': 4096,
 'max_lora_rank': 32,
 'enable_lora': False,
 },
 device_mesh=sampler_mesh,
 remote_group='sampler',
 )
 sampler.set_template('Qwen3_5Template', model_id=MODEL_ID)

 ckpt_manager = CheckpointEngineManager(model=model, sampler=sampler)

 dataloader = DataLoader(
 dataset=create_gsm8k_dataset,
 batch_size=BATCH_SIZE,
 min_batch_size=BATCH_SIZE,
 device_mesh=model_mesh,
 remote_group='model',
 )

 advantage_fn = GRPOAdvantage()
 metrics = CompletionRewardMetric()
 sampling_params = SamplingParams(max_tokens=MAX_NEW_TOKENS, num_samples=1, logprobs=1)

 optim_step = 0
 logger.info(get_device_placement())

 for batch in dataloader:
 if optim_step >= MAX_STEPS:
 break
 metrics.reset()
 global_prompts = batch if isinstance(batch, list) else [batch]

 # Sync weights to sampler
 ckpt_manager.sync_weights(merge_and_sync=True)
 sampler.reset_prefix_cache()

 # Group sampling: sample NUM_GENERATIONS completions per prompt
 sample_responses = sampler.sample(
 global_prompts * NUM_GENERATIONS,
 sampling_params,
 )

 all_input_data = []
 all_old_logps = []
 all_completion_lengths = []

 for sample_response in sample_responses:
 for sequence in sample_response.sequences:
 all_input_data.append(sequence.new_input_feature)
 all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])
 all_completion_lengths.append(len(sequence.tokens))

 # Compute rewards
 total_rewards, format_rewards, accuracy_rewards = compute_rewards(all_input_data)
 metrics.accumulate(
 completion_lengths=all_completion_lengths,
 rewards={
 'total': total_rewards,
 'format': format_rewards,
 'accuracy': accuracy_rewards,
 },
 )

 # GRPO advantage estimation: group-level normalization
 advantages = advantage_fn(total_rewards, num_generations=NUM_GENERATIONS, scale='group').tolist()

 # Mini-batch training
 total_completions = len(all_input_data)
 for mb_start in range(0, total_completions, MINI_BATCH_SIZE):
 mb_end = min(mb_start + MINI_BATCH_SIZE, total_completions)
 mb_inputs = all_input_data[mb_start:mb_end]
 mb_old_logps = all_old_logps[mb_start:mb_end]
 mb_advantages = advantages[mb_start:mb_end]

 model.forward_backward(
 inputs=mb_inputs,
 old_logps=mb_old_logps,
 advantages=mb_advantages,
 micro_batch_size=MICRO_BATCH_SIZE,
 )
 model.clip_grad_and_step()
 optim_step += 1

 if optim_step >= MAX_STEPS:
 break
 log_dict = metrics.calculate()
 log_dict.update(model.calculate_metric(is_training=True))
 metrics.reset()
 logger.info(f'[Step {optim_step}/{MAX_STEPS}] {log_dict}')

 logger.info(f'Training completed. optim_steps={optim_step}')
 model.save('grpo-gsm8k-checkpoint')

if __name__ == '__main__':
 main()

Since this runs on a Ray cluster, launching is simply:

python train.py

Key Design Points for GRPO Training:

Model-sampler separation: DeviceGroup splits 8 GPUs into two groups. Training and sampling run independently, allowing the sampling pipeline to fully leverage vLLM’s high throughput.
Group sampling strategy: global_prompts * NUM_GENERATIONS produces multiple completions per prompt, enabling advantage estimation via intra-group relative rewards — no separate value model needed.
Weight synchronization: ckpt_manager.sync_weights() syncs the training model weights to vLLM before each sampling step, ensuring the sampler always uses the latest policy.
Algorithm components exposed: GRPOAdvantage and GRPOLoss are registered directly on the model and can be swapped for other RL algorithm components without modifying any other code.

The core value of this pattern: the entire RL training loop — sampling, reward computation, advantage estimation, gradient update — is laid out in a visible Python main loop with no hidden magic. Differences between RL algorithms typically amount to swapping a few components.

3. Remote Training: Client-Server Architecture

When compute resources and service consumers are separated — enterprise training platforms, cloud Serverless training services — training capabilities need to be exposed as an API.

Twinkle supports two client integration modes:

Twinkle Client: API identical to local training, suitable for scenarios requiring fine-grained control
Tinker Client: Compatible with the ecosystem, with a simpler calling style

The server maintains a single base model; multiple clients can train their own LoRA adapters in parallel.

3.1 Twinkle Client: Fine-Grained Control

Twinkle Client provides an API nearly identical to local training, ideal for scenarios that require fine-grained control over the training process.

import dotenv
dotenv.load_dotenv('.env')

from peft import LoraConfig

from twinkle import get_logger
from twinkle.dataset import DatasetMeta
from twinkle_client import init_twinkle_client
from twinkle_client.dataloader import DataLoader
from twinkle_client.dataset import Dataset
from twinkle_client.model import MultiLoraTransformersModel

logger = get_logger()

# Initialize the Twinkle client
client = init_twinkle_client(base_url='http://127.0.0.1:8000', api_key='EMPTY_TOKEN')

# Query existing training runs and checkpoints
runs = client.list_training_runs()
resume_path = None
for run in runs:
 logger.info(run.model_dump_json(indent=2))
 checkpoints = client.list_checkpoints(run.training_run_id)
 for checkpoint in checkpoints:
 logger.info(checkpoint.model_dump_json(indent=2))
 # Uncomment to resume from a specific checkpoint:
 # resume_path = checkpoint.twinkle_path


def train():
 # Prepare dataset
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))
 dataset.set_template('Qwen3_5Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)
 dataset.map('SelfCognitionProcessor', init_args={'model_name': 'twinkle model', 'model_author': 'ModelScope Community'})
 dataset.encode(batched=True)
 dataloader = DataLoader(dataset=dataset, batch_size=4)

 # Configure model
 model = MultiLoraTransformersModel(model_id='ms://Qwen/Qwen3.5-4B')

 lora_config = LoraConfig(target_modules='all-linear')
 model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
 model.set_template('Qwen3_5Template')
 model.set_processor('InputProcessor', padding_side='right')
 model.set_loss('CrossEntropyLoss')
 model.set_optimizer('AdamW', lr=1e-4)
 model.set_lr_scheduler('LinearLR')

 # Resume from checkpoint if available
 if resume_path:
 logger.info(f'Resuming training from {resume_path}')
 model.load(resume_path, load_optimizer=True)

 logger.info(model.get_train_configs())

 for epoch in range(3):
 logger.info(f'Starting epoch {epoch}')
 for step, batch in enumerate(dataloader):
 # Forward + backward
 output = model.forward_backward(inputs=batch)

 if step % 2 == 0:
 logger.info(f'Current is step {step // 2}, loss: {output}')

 model.clip_grad_norm(1.0)
 model.step()
 model.zero_grad()
 model.lr_step()

 # Save checkpoint
 twinkle_path = model.save(name=f'twinkle-epoch-{epoch}', save_optimizer=True)
 logger.info(f'Saved checkpoint: {twinkle_path}')


if __name__ == '__main__':
 train()

Twinkle Client highlights:

API identical to local training — no additional learning curve
Supports checkpoint management and resume from checkpoint
Dynamically swap LoRA adapters, loss functions, and optimizer components

3.2 Tinker Client: Simple and Ready to Use

Tinker is a lightweight training API. Twinkle provides full support for the Tinker client — a few lines of code is all it takes to start training. Existing Tinker-based projects can be migrated directly to a Twinkle server.

import os
from tinker import types
from tqdm import tqdm

from twinkle import init_tinker_client
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.preprocessor import SelfCognitionProcessor
from twinkle.server.common import input_feature_to_datum

# Initialize Tinker client (must be called before importing ServiceClient)
init_tinker_client()

from tinker import ServiceClient

# Base model
base_model = 'Qwen/Qwen3.5-4B'
base_url = 'http://www.modelscope.cn/twinkle'


def train():
 # Prepare dataset
 dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))
 dataset.set_template('Qwen3_5Template', model_id=f'ms://{base_model}', max_length=256)
 dataset.map(SelfCognitionProcessor('Twinkle Model', 'ModelScope Team'), load_from_cache_file=False)
 dataset.encode(batched=True, load_from_cache_file=False)
 dataloader = DataLoader(dataset=dataset, batch_size=8)

 # Initialize training client
 service_client = ServiceClient(
 base_url=base_url,
 api_key=os.environ.get('MODELSCOPE_TOKEN')
 )
 training_client = service_client.create_lora_training_client(base_model=base_model, rank=16)

 # Training loop
 for epoch in range(3):
 print(f'Epoch {epoch}')
 for step, batch in tqdm(enumerate(dataloader)):
 # Convert input format
 input_datum = [input_feature_to_datum(input_feature) for input_feature in batch]

 # Remote forward + backward
 fwdbwd_future = training_client.forward_backward(input_datum, 'cross_entropy')
 # Remote optimizer step
 optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))

 # Wait for results
 fwdbwd_result = fwdbwd_future.result()
 optim_result = optim_future.result()
 print(f'Training Metrics: {optim_result}')

 # Save checkpoint
 save_future = training_client.save_state(f'twinkle-lora-{epoch}')
 save_result = save_future.result()
 print(f'Saved checkpoint to {save_result.path}')


if __name__ == '__main__':
 train()

Tinker Client highlights:

Minimal API surface, easy to get started
Fully compatible with the Tinker ecosystem — existing code migrates seamlessly
Supports ModelScope’s official training environment (see below)

3.3 ModelScope Official Training Environment

Alongside the open-source release of Twinkle, ModelScope provides a hosted model training service (Training as a Service, TaaS) powered by its own compute infrastructure. Developers can access Twinkle’s training capabilities for free via API, without provisioning any GPUs.

How to use:

Register a ModelScope account at
Obtain your API Key on the
Use the Tinker Client code above with the following endpoint:

base_url = 'https://www.modelscope.cn/twinkle'
base_model = 'Qwen/Qwen3.5-4B' # Model currently deployed in the official environment

4. Choosing the Right Training Mode

Scenario	Recommended Approach	Key Advantage
Local experimentation	Single GPU / torchrun	Code-as-config, high debugging efficiency
Large-scale distributed training	torchrun + FSDP2 / Ray	Native parallel performance, production-ready
Enterprise training platform	Twinkle Client + self-hosted server	Multi-tenant isolation, fine-grained control
Rapid prototyping	Tinker Client + ModelScope TaaS	Zero resource setup, instant access
Existing Tinker codebase	Tinker Client	Seamless migration, ecosystem compatibility

Recommendations:

If you are an algorithm researcher who frequently iterates on the training pipeline, start with torchrun mode and consider moving to a service-based setup once experiments are validated.
If you are a platform engineer building an internal training service, deploy Twinkle Server and offer both Twinkle Client and Tinker Client based on your users’ preferences.
If you just want to try Twinkle quickly, use the ModelScope official environment — get your first training run done in 5 minutes.

Twinkle’s design philosophy is to give you the building blocks, not make the decisions for you. Whether you need maximum performance at scale or maximum convenience via API, there’s a solution that fits.

Embedding Training

Mon, 01 Jan 0001 00:00:00 +0000

Twinkle supports contrastive embedding model training with InfoNCE loss, in-batch negatives, and cross-rank gathering. This guide demonstrates how to train embedding models using Twinkle.

Overview

Embedding training in Twinkle uses the following core components:

Component	Role
`InfonceLoss`	Contrastive loss with in-batch negatives
`EmbeddingMetric`	Tracks pos/neg similarity and loss
`TransformersModel`	Trainable embedding model (with LoRA or full)
`InputProcessor`	Processes anchor/positive pairs into features

Data Format

Each training sample consists of (anchor, positive) pairs. In the embedding feature tensor:

embeddings: [anchor_0, positive_0, anchor_1, positive_1, ...]
labels: [ 1, 0, 1, 0, ...]

labels=1 marks the start of a new group (anchor)
labels=0 marks positives/negatives within the group

Basic Embedding Training

A minimal embedding training script with DDP:

import twinkle
from twinkle import DeviceGroup, DeviceMesh, get_logger
from twinkle.dataloader import DataLoader
from twinkle.loss import InfonceLoss
from twinkle.metric import EmbeddingMetric
from twinkle.model import TransformersModel
from twinkle.processor import InputProcessor
from twinkle.template import Qwen3_5Template

logger = get_logger()

# --- Configuration ---
MODEL_ID = 'ms://Qwen/Qwen3.5-4B'
MODEL_GPUS = 4
BATCH_SIZE = 32
LEARNING_RATE = 1e-5
TEMPERATURE = 0.07
EMB_MAX_LENGTH = 8192

# --- Initialize ---
device_groups = [
 DeviceGroup(name='model', ranks=list(range(MODEL_GPUS)), device_type='GPU'),
]
model_mesh = DeviceMesh.from_sizes(world_size=MODEL_GPUS, dp_size=MODEL_GPUS)
twinkle.initialize(mode='ray', nproc_per_node=MODEL_GPUS, groups=device_groups)

# --- Model ---
model = TransformersModel(
 model_id=MODEL_ID,
 device_mesh=model_mesh,
 remote_group='model',
 ddp_config={'find_unused_parameters': True},
)
model.set_processor(InputProcessor)
model.set_loss(InfonceLoss, temperature=TEMPERATURE, use_batch=True)
model.set_optimizer(optimizer_cls='AdamW', lr=LEARNING_RATE)
model.set_lr_scheduler(
 scheduler_cls='CosineWarmupScheduler',
 num_warmup_steps=200,
 num_training_steps=total_steps,
)
model.add_metric(EmbeddingMetric, is_training=True)

# --- Template ---
template = Qwen3_5Template(
 model_id=MODEL_ID,
 max_length=EMB_MAX_LENGTH,
 enable_thinking=False,
)

# --- Training Loop ---
for step, batch in enumerate(dataloader):
 # batch: list of features with anchor/positive pairs
 model.forward_backward(inputs=batch, task='embedding')
 model.clip_grad_and_step(gradient_accumulation_steps=1)

 if step % 10 == 0:
 metric = model.calculate_metric(is_training=True)
 logger.info(f'Step {step}: {metric}')

Key Parameters

Parameter	Recommended	Description
`temperature`	0.05–0.1	Lower = sharper contrast. 0.07 keeps gradients flowing until cosine > 0.75
`use_batch`	True	Enables cross-sample in-batch negatives for better efficiency
`hard_negatives`	None or 7	Fix negative count per sample; None uses all in-batch
`find_unused_parameters`	True	Required for embedding models (only last hidden state contributes gradients)

Monitoring

The EmbeddingMetric reports key training signals:

Metric	What it means
`pos_sim`	Average anchor-positive cosine similarity (target: > 0.8)
`neg_sim`	Average anchor-negative similarity (target: < 0.3)
`loss`	InfoNCE loss value
`grad_norm`	Gradient magnitude

Healthy training shows pos_sim rising and neg_sim stable or falling. If pos_sim saturates near 1.0, lower the temperature.