Cookbook | Twinkle – LLM Training Framework by ModelScope

Multimodal SFT (VLM)

Vision-language model fine-tuning with image inputs (e.g. LaTeX OCR, Gemma4). View full source → import twinkle from twinkle import DeviceMesh from twinkle.cli import CLI from …

Jun 20, 2026 • 1 min read

DPO (Preference Optimization)

Direct Preference Optimization — align models using human preference data without reward modeling. Supports sigmoid/hinge/IPO/SimPO/ORPO/CPO variants. View full source → import …

Jun 15, 2026 • 1 min read

Multi-Turn RL (OpenEnv)

Multi-turn GRPO with interactive environments — the agent takes actions via tool calls and learns from episode rewards. View full source → import twinkle from twinkle import …

Jun 8, 2026 • 1 min read

On-Policy Distillation (GKD)

Generalized Knowledge Distillation: student generates on-policy, teacher provides top-k logprobs, student learns to match teacher’s distribution. View full source → import twinkle …

Jun 2, 2026 • 1 min read

Embedding Training

Train embedding models with InfoNCE contrastive loss. Supports both full-parameter and LoRA fine-tuning. View full source → import twinkle from twinkle import DeviceMesh from …

May 28, 2026 • 1 min read

GRPO (Reinforcement Learning)

Group Relative Policy Optimization with vLLM sampling and custom reward functions (e.g. GSM8K math). View full source → import twinkle from twinkle import DeviceMesh, DeviceGroup, …

May 22, 2026 • 1 min read

EP + MoE (DeepSeek V4 / Qwen3.5 MoE)

Expert-parallel + FSDP2 for Mixture-of-Experts models like DeepSeek V4 and Qwen3.5 MoE. View full source → import twinkle from twinkle import DeviceMesh, Platform, get_logger from …

May 18, 2026 • 1 min read

Ascend NPU — Megatron on MindSpeed

Training on Huawei Ascend NPUs using the Megatron backend with MindSpeed integration. Twinkle automatically applies fused NPU operators (RMSNorm, RoPE, SwiGLU, SDPA) via …

May 12, 2026 • 1 min read

Megatron TP Training

Tensor-parallel training via Megatron backend — ideal for large models that don’t fit on a single GPU. View full source → from peft import LoraConfig import twinkle from twinkle …

May 10, 2026 • 1 min read

SFT — Transformers FSDP2

Supervised fine-tuning with FSDP2 sharding and the Muon optimizer. Supports both full-parameter and LoRA training. View full source → from torch.optim import Muon import twinkle …

May 5, 2026 • 1 min read

Shell Launch (torchrun)

The standard way to launch local multi-GPU training with torchrun: #!/usr/bin/env bash CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ torchrun --nproc_per_node=8 fsdp2.py \ --model-id …

May 1, 2026 • 1 min read

No results found