Quick Start
Quick Start
Get Twinkle up and running in minutes.
What is Twinkle?
Twinkle is a production-oriented large model training framework with a component-based architecture:
- Loosely Coupled Architecture · Standardized Interfaces
- Multiple Runtime Modes · torchrun / Ray / HTTP
- Multi-Framework Compatible · Transformers / Megatron
- Multi-Tenant Support · Single Base Model Deployment
When to Choose Twinkle
- You want to understand model mechanisms and training methods
- You are a researcher who wants to customize models or training algorithms
- You prefer writing explicit training loops
- You need to build enterprise-level training platforms
When to Choose ms-swift
- You don’t care about the training process, just want to provide data
- You need more model support and dataset varieties
- You need inference, deployment, quantization capabilities
- You need day-0 support for new models
Prerequisites
| Component | Requirement | Notes |
|---|---|---|
| Python | >= 3.11, < 3.13 | 3.11 recommended |
| PyTorch | >= 2.0 | 2.7.1 for NPU |
| GPU | A10/A100/H100/RTX | T4/V100 limited support |
| NPU | Ascend 910 series | Optional |
Installation
Install Twinkle
# Install from PyPI
pip install 'twinkle-kit'
# Or install from source
git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e .
Install Client (Optional)
If you need to use Twinkle’s Client for remote training:
# Mac or Linux
sh INSTALL_CLIENT.sh
# Windows (PowerShell)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
.\INSTALL_CLIENT.ps1
Install Megatron Dependencies (Optional)
For ultra-large-scale model training with Megatron backend:
sh INSTALL_MEGATRON.sh
Your First Training
Single GPU Training
from peft import LoraConfig
from twinkle import get_device_placement, get_logger
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.model import TransformersModel
from twinkle.preprocessor import SelfCognitionProcessor
logger = get_logger()
def train():
# Load dataset (1000 samples)
dataset = Dataset(dataset_meta=DatasetMeta(
'ms://swift/self-cognition',
data_slice=range(1000)
))
# Set template for encoding
dataset.set_template('Template', model_id='ms://Qwen/Qwen3.5-4B')
# Preprocess to standard format
dataset.map(SelfCognitionProcessor('Twinkle LLM', 'ModelScope'))
# Encode dataset
dataset.encode()
# Create dataloader (global batch size = 8)
dataloader = DataLoader(dataset=dataset, batch_size=8)
# Load model
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')
# Add LoRA adapter
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear')
model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)
# Set optimizer and scheduler
model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
model.set_lr_scheduler(
scheduler_cls='CosineWarmupScheduler',
num_warmup_steps=5,
num_training_steps=len(dataloader)
)
# Training loop
for step, batch in enumerate(dataloader):
model.forward_backward(inputs=batch)
model.clip_grad_and_step()
if step % 20 == 0:
metric = model.calculate_metric(is_training=True)
logger.info(f'Step {step}/{len(dataloader)}, metric: {metric}')
model.save('last-checkpoint')
if __name__ == '__main__':
train()
Multi-GPU Training (8 GPUs)
Just add initialization with DeviceMesh:
import twinkle
from twinkle import DeviceMesh
# Build device_mesh: fsdp=4, dp=2, using 8 GPUs
device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2)
twinkle.initialize(mode='local', global_device_mesh=device_mesh)
# ... rest of training code remains the same
Run with torchrun:
torchrun --nproc_per_node=8 train.py
Using Only Partial Components
You can use only a portion of Twinkle’s components with your existing code:
from twinkle.dataset import PackingDataset, DatasetMeta
from twinkle.dataloader import DataLoader
from twinkle.preprocessor import SelfCognitionProcessor
dataset_meta = DatasetMeta(
dataset_id='ms://swift/self-cognition',
)
dataset = PackingDataset(dataset_meta)
dataset.map(SelfCognitionProcessor(
model_name='Twinkle Model',
model_author='ModelScope Community'
))
dataset.set_template('Template', model_id='ms://Qwen/Qwen3.5-4B', max_length=512)
dataset.encode()
dataset.pack_dataset()
dataloader = DataLoader(dataset, batch_size=8)
for data in dataloader:
# Use with your custom training code
print(data.keys()) # input_ids, position_ids, ...
break
Supported Hardware
| Hardware | Notes |
|---|---|
| GPU A10/A100/H100/RTX | Full support |
| GPU T4/V100 | No bfloat16, no Flash-Attention |
| Ascend NPU | Some operators not supported |
| PPU | Supported |
| CPU | Partial components (dataset, dataloader) |
Next Steps
Architecture
Understand the client-server architecture
ComponentsExplore Dataset, Model, Sampler, and more
Runtime Modestorchrun, Ray, and HTTP deployment
Server & ClientHTTP training service architecture
Multi-TenancyTrain multiple LoRAs on shared base model
chipNPU SupportAscend NPU training guide
Training as a ServiceDeploy enterprise-grade training services
CookbookFSDP, MoE, RL training examples