NPU Support

Twinkle supports Huawei Ascend NPU for training. This guide covers installation and usage in NPU environments.

Requirements

ComponentVersionNotes
Python>= 3.11, < 3.133.11 recommended
Ascend HDKLatestHardware driver and firmware
CANN Toolkit8.3.RC1+~10GB disk space
PyTorch2.7.1Must match torch_npu
torch_npu2.7.1Must match PyTorch
Note
torch and torch_npu versions must be exactly the same (e.g., both 2.7.1)

Supported Hardware

  • Ascend 910 series
  • Other compatible Ascend accelerator cards

Installation

Install NPU Environment

Follow the torch_npu Official Installation Guide to install:

  • Ascend driver (HDK)
  • CANN toolkit
  • PyTorch and torch_npu

Install Twinkle

git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e ".[transformers,ray]"

Install vLLM (Optional)

For vLLMSampler support:

pip install vllm==0.11.0
pip install vllm-ascend==0.11.0rc3
Note
Install in order above, ignoring dependency conflict warnings. Activate CANN first: source /usr/local/Ascend/ascend-toolkit/set_env.sh

Verify Installation

import torch
import torch_npu

print(f"PyTorch version: {torch.__version__}")
print(f"torch_npu version: {torch_npu.__version__}")
print(f"NPU available: {torch.npu.is_available()}")
print(f"NPU device count: {torch.npu.device_count()}")

if torch.npu.is_available():
    x = torch.randn(3, 3).npu()
    y = torch.randn(3, 3).npu()
    z = x + y
    print(f"NPU computation test passed: {z.shape}")

Quick Start Examples

SFT LoRA Fine-tuning (4-card DP+FSDP)

Example: cookbook/transformers/fsdp2.py

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
python cookbook/transformers/fsdp2.py

GRPO Reinforcement Learning (8-card)

Example: cookbook/rl/grpo.py

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python cookbook/rl/grpo.py

DP + FSDP Configuration

import numpy as np
from twinkle import DeviceMesh

# 4 cards: DP=2, FSDP=2
device_mesh = DeviceMesh(
    device_type='npu',
    mesh=np.array([[0, 1], [2, 3]]),
    mesh_dim_names=('dp', 'fsdp')
)

Parallelization Support

StrategyDescriptionNPU SupportStatus
DPData ParallelVerified
FSDPFully Sharded Data ParallelVerified
TPTensor Parallel (Megatron)🚧To be verified
PPPipeline Parallel (Megatron)🚧To be verified
CPContext Parallel🚧To be verified
EPExpert Parallel (MoE)🚧To be verified

Feature Support Matrix

FeatureGPUNPUExampleNotes
SFT + LoRAcookbook/transformers/fsdp2.pyVerified
GRPOcookbook/rl/grpo.pyVerified
DP Parallelismcookbook/transformers/fsdp2.pyVerified
FSDP Parallelismcookbook/transformers/fsdp2.pyVerified
Ray Distributedcookbook/transformers/fsdp2.pyVerified
TorchSamplercookbook/rl/grpo.pyVerified
vLLMSamplercookbook/rl/grpo.pyVerified
Full Fine-tuning🚧-To be verified
QLoRA-Quantization not supported
DPO🚧-To be verified
Megatron TP/PP🚧-To be verified
Flash Attention⚠️-Some operators not supported

Legend: ✅ Verified | 🚧 To be verified | ⚠️ Partial support | ❌ Not supported

Troubleshooting

torch_npu Version Mismatch

# Check versions
python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"

# Reinstall matching versions
pip uninstall torch torch_npu -y
pip install torch==2.7.1
pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl

CANN Compatibility

Check Ascend Community Version Compatibility Table

Debug Logging

export ASCEND_GLOBAL_LOG_LEVEL=1
python your_script.py

Resources

docs