Multimodal SFT (VLM)

Vision-language model fine-tuning with image inputs (e.g. LaTeX OCR, Gemma4). View full source → import twinkle from twinkle import DeviceMesh from twinkle.cli import CLI from …

DPO (Preference Optimization)

Direct Preference Optimization — align models using human preference data without reward modeling. Supports sigmoid/hinge/IPO/SimPO/ORPO/CPO variants. View full source → import …

Multi-Turn RL (OpenEnv)

Multi-turn GRPO with interactive environments — the agent takes actions via tool calls and learns from episode rewards. View full source → import twinkle from twinkle import …

On-Policy Distillation (GKD)

Generalized Knowledge Distillation: student generates on-policy, teacher provides top-k logprobs, student learns to match teacher’s distribution. View full source → import twinkle …

Embedding Training

Train embedding models with InfoNCE contrastive loss. Supports both full-parameter and LoRA fine-tuning. View full source → import twinkle from twinkle import DeviceMesh from …

GRPO (Reinforcement Learning)

Group Relative Policy Optimization with vLLM sampling and custom reward functions (e.g. GSM8K math). View full source → import twinkle from twinkle import DeviceMesh, DeviceGroup, …

EP + MoE (DeepSeek V4 / Qwen3.5 MoE)

Expert-parallel + FSDP2 for Mixture-of-Experts models like DeepSeek V4 and Qwen3.5 MoE. View full source → import twinkle from twinkle import DeviceMesh, Platform, get_logger from …

Ascend NPU — Megatron on MindSpeed

Training on Huawei Ascend NPUs using the Megatron backend with MindSpeed integration. Twinkle automatically applies fused NPU operators (RMSNorm, RoPE, SwiGLU, SDPA) via …

Megatron TP Training

Tensor-parallel training via Megatron backend — ideal for large models that don’t fit on a single GPU. View full source → from peft import LoraConfig import twinkle from twinkle …

SFT — Transformers FSDP2

Supervised fine-tuning with FSDP2 sharding and the Muon optimizer. Supports both full-parameter and LoRA training. View full source → from torch.optim import Muon import twinkle …

Shell Launch (torchrun)

The standard way to launch local multi-GPU training with torchrun: #!/usr/bin/env bash CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ torchrun --nproc_per_node=8 fsdp2.py \ --model-id …