NPU (Ascend) Quick Start Guide
This document describes how to install and use the Twinkle framework in Huawei Ascend NPU environments.
Environment Requirements
Before getting started, please ensure your system meets the following requirements:
| Component | Version Requirement | Description |
|---|---|---|
| Python | >= 3.11, < 3.13 | Twinkle framework requirement |
| Ascend Firmware Driver (HDK) | Latest version recommended | Hardware driver and firmware |
| CANN Toolkit | 8.5.1 or higher | Heterogeneous Computing Architecture |
| PyTorch | 2.7.1 | Deep learning framework |
| torch_npu | 2.7.1 | Ascend PyTorch adapter plugin |
Important Notes:
- torch and torch_npu versions must be exactly the same (e.g., both 2.7.1)
- Python 3.11 is recommended for best compatibility
- CANN toolkit requires approximately 10GB+ disk space
Supported Hardware
Twinkle currently supports the following Ascend NPU devices:
- Ascend 910 series
- Other compatible Ascend accelerator cards
Installation Steps
1. Install NPU Environment (Driver, CANN, torch_npu)
NPU environment installation includes Ascend driver, CANN toolkit, PyTorch, and torch_npu.
📖 Complete Installation Tutorial: torch_npu Official Installation Guide
This documentation includes:
- Ascend driver (HDK) installation steps
- CANN toolkit installation steps
- PyTorch and torch_npu installation steps
- Version compatibility instructions
Recommended Version Configuration:
- Python: 3.11
- PyTorch: 2.7.1
- torch_npu: 2.7.1
- CANN: 8.5.1 or higher
2. Install Twinkle
After NPU environment configuration is complete, install the Twinkle framework from source:
git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e ".[transformers,ray]"
3. Install vLLM and vLLM-Ascend (Optional)
If you need to use vLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend.
Installation Steps:
# Step 1: Install vLLM
pip install vllm==0.14.0
# Step 2: Install vLLM-Ascend
pip install vllm-ascend==0.14.0rc1
Notes:
- Install in the above order, ignoring possible dependency conflict warnings
- Ensure CANN environment is activated before installation:
source /usr/local/Ascend/ascend-toolkit/set_env.sh - Recommended versions are vLLM 0.14.0 and vLLM-Ascend 0.14.0rc1
4. Verify Installation
Create test script verify_npu.py:
import torch
import torch_npu
print(f"PyTorch version: {torch.__version__}")
print(f"torch_npu version: {torch_npu.__version__}")
print(f"NPU available: {torch.npu.is_available()}")
print(f"NPU device count: {torch.npu.device_count()}")
if torch.npu.is_available():
print(f"Current NPU device: {torch.npu.current_device()}")
print(f"NPU device name: {torch.npu.get_device_name(0)}")
# Simple test
x = torch.randn(3, 3).npu()
y = torch.randn(3, 3).npu()
z = x + y
print(f"NPU computation test passed: {z.shape}")
Run verification:
python verify_npu.py
If the output shows NPU available: True and no errors, installation is successful!
Note: Twinkle does not currently provide NPU Docker images. Manual installation is recommended. For containerized deployment, please refer to official images from the Ascend community.
5. Install Megatron Backend Dependencies
Recommended versions:
- Megatron-LM:
v0.15.3 - MindSpeed:
core_r0.15.3 - mcore-bridge: main branch or the version already validated in your Twinkle checkout
Installation steps:
# 1. Clone Megatron-LM and pin the compatible version
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout v0.15.3
cd ..
# 2. Clone and install MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed
git checkout core_r0.15.3
pip install -e .
cd ..
# 3. Clone and install mcore-bridge
git clone https://github.com/modelscope/mcore-bridge.git
cd mcore-bridge
pip install -e .
cd ..
# 4. Install Twinkle if needed
cd twinkle
pip install -e ".[transformers,ray]"
Runtime environment variables:
export PYTHONPATH=$PYTHONPATH:<path/to/Megatron-LM>
export MEGATRON_LM_PATH=</path/to/Megatron-LM>
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Verification:
First run a minimal import check to make sure the current environment can resolve MindSpeed and Megatron-LM:
python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')"
6. Qwen3.5/3.6 FLA and Triton-Ascend Version Compatibility
FLA Enablement Conditions
To use FLA (Flash Linear Attention) with Qwen3.5/3.6 on the transformers backend, the following conditions must be met:
- Install
triton-ascend mindspeedversion26.0.0_core_r0.12.1
Triton-Ascend Version and CANN Compatibility
| triton-ascend | CANN | Additional Dependencies |
|---|---|---|
| 3.2.0 | 8.5.x | Do not install triton |
| 3.2.1 | 9.0.0 | triton must be installed |
MindSpeed Version and Code Adaptation
The currently validated MindSpeed version is 26.0.0_core_r0.12.1. MindSpeed repository: https://gitcode.com/Ascend/MindSpeed
If using a higher MindSpeed version, note that the following import paths in src/twinkle/kernel/chunk_gated_delta_rule.py may need to be adjusted to match the actual code locations in MindSpeed:
from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h
from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o
from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd
from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum
from mindspeed.lite.ops.triton.solve_tril import solve_tril
from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard
from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd
7. NPU Patch Environment Variable Configuration
Twinkle enables model-layer patches by default in NPU environments. The following environment variables provide fine-grained control:
| Environment Variable | Description | Default |
|---|---|---|
TWINKLE_NPU_PATCH | Master switch for all NPU optimizations | 1 (enabled) |
TWINKLE_NPU_FUSED_OPS | Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA) | 1 (enabled) |
TWINKLE_NPU_MOE_PATCH | Enable MoE Grouped MatMul | 1 (enabled) |
TWINKLE_NPU_FLA | Enable Qwen3.5 Flash Linear Attention; set to 0 to force torch fallback | 1 (enabled) |
Usage examples:
# Disable all NPU optimizations and fall back to native Transformers
export TWINKLE_NPU_PATCH=0
# Disable FLA only while keeping other fused operators
export TWINKLE_NPU_FLA=0
# Disable MoE patch only
export TWINKLE_NPU_MOE_PATCH=0
Quick Start
Important Notice: The following examples are from the cookbook/ directory and have been verified in actual NPU environments. It is recommended to run scripts directly from the cookbook rather than copying and pasting code snippets.
SFT LoRA Fine-tuning
The NPU document no longer provides this kind of SFT cookbook example; this capability should be described together with an actually available cookbook example or a future NPU script.
GRPO Reinforcement Learning Training
The NPU document no longer provides this kind of GRPO cookbook example; this capability should be described together with an actually available cookbook example or a future NPU script.
More Examples
Check the cookbook/remote/tinker/ascend/ directory for remote training server-side configuration.
Parallelization Strategies
Twinkle currently supports the following verified parallelization strategies on NPU:
| Parallel Type | Description | NPU Support | Verification Status |
|---|---|---|---|
| DP (Data Parallel) | Data parallelism | ✅ | No corresponding cookbook example |
| FSDP (Fully Sharded Data Parallel) | Fully sharded data parallelism | ✅ | No corresponding cookbook example |
| TP (Tensor Parallel) | Tensor parallelism (Megatron) | ✅ | Verified (see cookbook/megatron/ascend/tp_npu.py) |
| PP (Pipeline Parallel) | Pipeline parallelism (Megatron) | ✅ | Verified (see cookbook/megatron/ascend/tp_npu.py) |
| CP (Context Parallel) | Context parallelism | ✅ | Verified (see cookbook/megatron/ascend/tp_moe_cp_npu.py) |
| EP (Expert Parallel) | Expert parallelism (MoE) | ✅ | Verified (see cookbook/megatron/ascend/tp_moe_npu.py) |
Legend:
- ✅ Verified: Has actual running example code
- 🚧 To be verified: Theoretically supported but no NPU verification example yet
- ❌ Not supported: Not available in current version
DP + FSDP Example
The NPU document currently does not provide a corresponding cookbook code snippet.
Megatron backend note: Twinkle now provides runnable NPU smoke scripts for the Megatron backend. Please follow the installation section above before running the cookbook examples, and start with cookbook/megatron/ascend/tp_npu.py before moving on to cookbook/megatron/ascend/tp_moe_npu.py and cookbook/megatron/ascend/tp_moe_cp_npu.py.
Common Issues
1. torch_npu Version Mismatch
Problem: Version incompatibility warnings or errors after installing torch_npu.
Solution:
- Ensure torch and torch_npu versions are exactly the same
- Check if CANN version is compatible with torch_npu
# Check current versions
python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"
# Reinstall matching versions
pip uninstall torch torch_npu -y
pip install torch==2.7.1
pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl
2. CANN Toolkit Version Issue
Problem: CANN version incompatible with torch_npu.
Solution:
- Refer to Ascend Community Version Compatibility Table
- Install corresponding CANN toolkit version
Feature Support Status
Feature support matrix based on actual code verification:
| Feature | GPU | NPU | Verification Example | Description |
|---|---|---|---|---|
| SFT + LoRA | ✅ | ✅ | - | No corresponding cookbook example |
| GRPO | ✅ | ✅ | - | No corresponding cookbook example |
| DP Parallelism | ✅ | ✅ | - | No corresponding cookbook example |
| FSDP Parallelism | ✅ | ✅ | - | No corresponding cookbook example |
| Ray Distributed | ✅ | ✅ | - | No corresponding cookbook example |
| TorchSampler | ✅ | ✅ | - | No corresponding cookbook example |
| vLLMSampler | ✅ | ✅ | - | No corresponding cookbook example |
| Full Fine-tuning | ✅ | ✅ | - | Verified available |
| QLoRA | ✅ | ❌ | - | Quantization operators not yet supported |
| DPO | ✅ | 🚧 | - | Theoretically supported, to be verified |
| Megatron TP/PP | ✅ | 🚧 | - | To be adapted and verified |
| Flash Attention | ✅ | ⚠️ | - | Some operators not supported |
Legend:
- ✅ Verified: Has actual running example, confirmed available
- 🚧 To be verified: Theoretically supported but no NPU environment verification yet
- ⚠️ Partial support: Available but with limitations or performance differences
- ❌ Not supported: Not available in current version
Usage Recommendations:
- Prioritize features marked as “Verified” for guaranteed stability
- “To be verified” features can be attempted but may encounter compatibility issues
- Refer to corresponding example code when encountering problems
Example Code
Twinkle’s verified NPU examples currently focus on the Megatron smoke path; the SFT and GRPO cookbook examples do not have corresponding files yet.
Remote Training (Tinker Protocol)
- Server Configuration: cookbook/remote/tinker/ascend/
- Provides HTTP API interface
- Supports remote training and inference
- Suitable for production environment deployment
Running Examples: No corresponding command examples are provided yet.
Reference Resources
- Ascend Community Official Website
- CANN Software Installation Guide
- torch_npu GitHub
- Twinkle GitHub
- Twinkle Documentation
Getting Help
If you encounter issues during use:
- Check Logs: Set environment variable
ASCEND_GLOBAL_LOG_LEVEL=1for detailed logs - Submit Issue: Twinkle GitHub Issues
- Community Discussion: Ascend Community Forum
Next Steps
- 📖 Read Quick Start for more training examples
- 📖 Read Installation Guide for other platform installations
- 🚀 Browse the
cookbook/directory for complete example code - 💡 Check Twinkle Documentation for advanced features