Supported Models

Twinkle supports any model compatible with HuggingFace Transformers or Megatron-LM. Below is a curated list of models tested with Twinkle.

Language Models

Model FamilyModel IDsParametersFeatures
Qwen 3.5Qwen/Qwen3.5-0.6B ~ Qwen/Qwen3.5-235B-A22B0.6B–235BMoE, Thinking mode
Qwen 2.5Qwen/Qwen2.5-0.5B ~ Qwen/Qwen2.5-72B0.5B–72BDense
DeepSeek V4deepseek-ai/DeepSeek-V4685B MoECustom DSML encoding
DeepSeek R1deepseek-ai/DeepSeek-R1685B MoEReasoning
LLaMA 3meta-llama/Llama-3.3-70B-Instruct8B–70BDense
Mistralmistralai/Mistral-7B-v0.37BDense
Yi01-ai/Yi-1.5-34B6B–34BDense
GLM-4THUDM/glm-4-9b-chat9BDense
InternLM 2.5internlm/internlm2_5-7b-chat7B–20BDense

Vision-Language Models

Model FamilyModel IDsFeatures
Qwen 3.5 VLQwen/Qwen3.5-VL-3B ~ Qwen/Qwen3.5-VL-72BImage, Video
Qwen 2.5 VLQwen/Qwen2.5-VL-7B-InstructImage, Video
InternVL 2.5OpenGVLab/InternVL2_5-8BImage

Embedding Models

Model FamilyModel IDsTraining Method
Qwen3 EmbeddingQwen/Qwen3-Embedding-0.6BInfoNCE contrastive
GTEthenlper/gte-large-zhInfoNCE contrastive

Model Loading

Models can be loaded from ModelScope or HuggingFace:

from twinkle.model import TransformersModel

# From ModelScope (ms:// prefix)
model = TransformersModel(model_id='ms://Qwen/Qwen3.5-4B')

# From HuggingFace (hf:// prefix)
model = TransformersModel(model_id='hf://meta-llama/Llama-3.3-70B-Instruct')

# Local path
model = TransformersModel(model_id='/path/to/model')

Framework Support

FrameworkClassUse Case
TransformersTransformersModelGeneral training (SFT, RLHF, DPO)
Transformers + Multi-LoRAMultiLoraTransformersModelMulti-tenant training
Megatron-LMMegatronModelLarge-scale distributed pre-training
Megatron + Multi-LoRAMultiLoraMegatronModelLarge-scale multi-tenant

Precision Support

ModeDescription
bf16BFloat16 mixed precision (recommended for A100/H100)
fp16Float16 mixed precision (for older GPUs)
fp8FP8 precision (H100 with Transformer Engine)
noFull precision (debugging only)

Parallelism Strategies

StrategyConfig KeyDescription
FSDPstrategy=accelerateAccelerate-managed FSDP (default)
Native FSDPstrategy=native_fsdpPyTorch native FSDP
Tensor Paralleltp_sizeSplit layers across GPUs
Pipeline Parallelpp_sizeSplit model stages
Data Paralleldp_sizeReplicate model, split data
Sequence Parallelsequence_parallelSplit long sequences
Expert Parallelep_sizeMoE expert distribution
docs