NPU Support

NPU Support

Twinkle supports Huawei Ascend NPU for training. This guide covers installation and usage in NPU environments.

Requirements

Component	Version	Notes
Python	>= 3.11, < 3.13	3.11 recommended
Ascend HDK	Latest	Hardware driver and firmware
CANN Toolkit	8.3.RC1+	~10GB disk space
PyTorch	2.7.1	Must match torch_npu
torch_npu	2.7.1	Must match PyTorch

Note

torch and torch_npu versions must be exactly the same (e.g., both 2.7.1)

Supported Hardware

Ascend 910 series
Other compatible Ascend accelerator cards

Installation

Install NPU Environment

Follow the torch_npu Official Installation Guide to install:

Ascend driver (HDK)
CANN toolkit
PyTorch and torch_npu

Install Twinkle

git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e ".[transformers,ray]"

Install vLLM (Optional)

For vLLMSampler support:

pip install vllm==0.11.0
pip install vllm-ascend==0.11.0rc3

Note

Install in order above, ignoring dependency conflict warnings. Activate CANN first: source /usr/local/Ascend/ascend-toolkit/set_env.sh

Verify Installation

import torch
import torch_npu

print(f"PyTorch version: {torch.__version__}")
print(f"torch_npu version: {torch_npu.__version__}")
print(f"NPU available: {torch.npu.is_available()}")
print(f"NPU device count: {torch.npu.device_count()}")

if torch.npu.is_available():
    x = torch.randn(3, 3).npu()
    y = torch.randn(3, 3).npu()
    z = x + y
    print(f"NPU computation test passed: {z.shape}")

Quick Start Examples

SFT LoRA Fine-tuning (4-card DP+FSDP)

Example: cookbook/transformers/fsdp2.py

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
python cookbook/transformers/fsdp2.py

GRPO Reinforcement Learning (8-card)

Example: cookbook/rl/grpo.py

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python cookbook/rl/grpo.py

DP + FSDP Configuration

import numpy as np
from twinkle import DeviceMesh

# 4 cards: DP=2, FSDP=2
device_mesh = DeviceMesh(
    device_type='npu',
    mesh=np.array([[0, 1], [2, 3]]),
    mesh_dim_names=('dp', 'fsdp')
)

Parallelization Support

Strategy	Description	NPU Support	Status
DP	Data Parallel	✅	Verified
FSDP	Fully Sharded Data Parallel	✅	Verified
TP	Tensor Parallel (Megatron)	🚧	To be verified
PP	Pipeline Parallel (Megatron)	🚧	To be verified
CP	Context Parallel	🚧	To be verified
EP	Expert Parallel (MoE)	🚧	To be verified

Feature Support Matrix

Feature	GPU	NPU	Example	Notes
SFT + LoRA	✅	✅	cookbook/transformers/fsdp2.py	Verified
GRPO	✅	✅	cookbook/rl/grpo.py	Verified
DP Parallelism	✅	✅	cookbook/transformers/fsdp2.py	Verified
FSDP Parallelism	✅	✅	cookbook/transformers/fsdp2.py	Verified
Ray Distributed	✅	✅	cookbook/transformers/fsdp2.py	Verified
TorchSampler	✅	✅	cookbook/rl/grpo.py	Verified
vLLMSampler	✅	✅	cookbook/rl/grpo.py	Verified
Full Fine-tuning	✅	🚧	-	To be verified
QLoRA	✅	❌	-	Quantization not supported
DPO	✅	🚧	-	To be verified
Megatron TP/PP	✅	🚧	-	To be verified
Flash Attention	✅	⚠️	-	Some operators not supported

Legend: ✅ Verified | 🚧 To be verified | ⚠️ Partial support | ❌ Not supported

Troubleshooting

torch_npu Version Mismatch

# Check versions
python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"

# Reinstall matching versions
pip uninstall torch torch_npu -y
pip install torch==2.7.1
pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl

CANN Compatibility

Check Ascend Community Version Compatibility Table

Debug Logging

export ASCEND_GLOBAL_LOG_LEVEL=1
python your_script.py

Resources

← Cookbook

Architecture →

No results found

Requirements

Supported Hardware

Installation

Install NPU Environment

Install Twinkle

Install vLLM (Optional)

Verify Installation

Quick Start Examples

SFT LoRA Fine-tuning (4-card DP+FSDP)

GRPO Reinforcement Learning (8-card)

DP + FSDP Configuration

Parallelization Support

Feature Support Matrix

Troubleshooting

torch_npu Version Mismatch

CANN Compatibility

Debug Logging

Resources