NPU (Ascend) Quick Start Guide

This document describes how to install and use the Twinkle framework in Huawei Ascend NPU environments.

Environment Requirements

Before getting started, please ensure your system meets the following requirements:

Component	Version Requirement	Description
Python	>= 3.11, < 3.13	Twinkle framework requirement
Ascend Firmware Driver (HDK)	Latest version recommended	Hardware driver and firmware
CANN Toolkit	8.5.1 or higher	Heterogeneous Computing Architecture
PyTorch	2.7.1	Deep learning framework
torch_npu	2.7.1	Ascend PyTorch adapter plugin

Important Notes:

torch and torch_npu versions must be exactly the same (e.g., both 2.7.1)
Python 3.11 is recommended for best compatibility
CANN toolkit requires approximately 10GB+ disk space

Supported Hardware

Twinkle currently supports the following Ascend NPU devices:

Ascend 910 series
Other compatible Ascend accelerator cards

Installation Steps

1. Install NPU Environment (Driver, CANN, torch_npu)

NPU environment installation includes Ascend driver, CANN toolkit, PyTorch, and torch_npu.

📖 Complete Installation Tutorial: torch_npu Official Installation Guide

This documentation includes:

Ascend driver (HDK) installation steps
CANN toolkit installation steps
PyTorch and torch_npu installation steps
Version compatibility instructions

Recommended Version Configuration:

Python: 3.11
PyTorch: 2.7.1
torch_npu: 2.7.1
CANN: 8.5.1 or higher

2. Install Twinkle

After NPU environment configuration is complete, install the Twinkle framework from source:

git clone https://github.com/modelscope/twinkle.git
cd twinkle
pip install -e ".[transformers,ray]"

3. Install vLLM and vLLM-Ascend (Optional)

If you need to use vLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend.

Installation Steps:

# Step 1: Install vLLM
pip install vllm==0.14.0

# Step 2: Install vLLM-Ascend
pip install vllm-ascend==0.14.0rc1

Notes:

Install in the above order, ignoring possible dependency conflict warnings
Ensure CANN environment is activated before installation: source /usr/local/Ascend/ascend-toolkit/set_env.sh
Recommended versions are vLLM 0.14.0 and vLLM-Ascend 0.14.0rc1

4. Verify Installation

Create test script verify_npu.py:

import torch
import torch_npu

print(f"PyTorch version: {torch.__version__}")
print(f"torch_npu version: {torch_npu.__version__}")
print(f"NPU available: {torch.npu.is_available()}")
print(f"NPU device count: {torch.npu.device_count()}")

if torch.npu.is_available():
    print(f"Current NPU device: {torch.npu.current_device()}")
    print(f"NPU device name: {torch.npu.get_device_name(0)}")

    # Simple test
    x = torch.randn(3, 3).npu()
    y = torch.randn(3, 3).npu()
    z = x + y
    print(f"NPU computation test passed: {z.shape}")

Run verification:

python verify_npu.py

If the output shows NPU available: True and no errors, installation is successful!

Note: Twinkle does not currently provide NPU Docker images. Manual installation is recommended. For containerized deployment, please refer to official images from the Ascend community.

5. Install Megatron Backend Dependencies

Recommended versions:

Megatron-LM: v0.15.3
MindSpeed: core_r0.15.3
mcore-bridge: main branch or the version already validated in your Twinkle checkout

Installation steps:

# 1. Clone Megatron-LM and pin the compatible version
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout v0.15.3
cd ..

# 2. Clone and install MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed
git checkout core_r0.15.3
pip install -e .
cd ..

# 3. Clone and install mcore-bridge
git clone https://github.com/modelscope/mcore-bridge.git
cd mcore-bridge
pip install -e .
cd ..

# 4. Install Twinkle if needed
cd twinkle
pip install -e ".[transformers,ray]"

Runtime environment variables:

export PYTHONPATH=$PYTHONPATH:<path/to/Megatron-LM>
export MEGATRON_LM_PATH=</path/to/Megatron-LM>
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Verification:

First run a minimal import check to make sure the current environment can resolve MindSpeed and Megatron-LM:

python -c "import mindspeed.megatron_adaptor; from twinkle.model.megatron._mindspeed_runtime import ensure_mindspeed_adaptor_patched; ensure_mindspeed_adaptor_patched(); print('✓ Megatron backend imports are ready')"

6. Qwen3.5/3.6 FLA and Triton-Ascend Version Compatibility

FLA Enablement Conditions

To use FLA (Flash Linear Attention) with Qwen3.5/3.6 on the transformers backend, the following conditions must be met:

Install triton-ascend
mindspeed version 26.0.0_core_r0.12.1

Triton-Ascend Version and CANN Compatibility

triton-ascend	CANN	Additional Dependencies
3.2.0	8.5.x	Do not install `triton`
3.2.1	9.0.0	`triton` must be installed

MindSpeed Version and Code Adaptation

The currently validated MindSpeed version is 26.0.0_core_r0.12.1. MindSpeed repository: https://gitcode.com/Ascend/MindSpeed

If using a higher MindSpeed version, note that the following import paths in src/twinkle/kernel/chunk_gated_delta_rule.py may need to be adjusted to match the actual code locations in MindSpeed:

from mindspeed.lite.ops.triton.chunk_delta_h import chunk_gated_delta_rule_bwd_dhu, chunk_gated_delta_rule_fwd_h
from mindspeed.lite.ops.triton.chunk_o import chunk_bwd_dqkwg, chunk_bwd_dv_local, chunk_fwd_o
from mindspeed.lite.ops.triton.chunk_scaled_dot_kkt import chunk_scaled_dot_kkt_fwd
from mindspeed.lite.ops.triton.cumsum import chunk_local_cumsum
from mindspeed.lite.ops.triton.solve_tril import solve_tril
from mindspeed.lite.ops.triton.utils import autocast_custom_bwd, autocast_custom_fwd, input_guard
from mindspeed.lite.ops.triton.wy_fast import prepare_wy_repr_bwd, recompute_w_u_fwd

7. NPU Patch Environment Variable Configuration

Twinkle enables model-layer patches by default in NPU environments. The following environment variables provide fine-grained control:

Environment Variable	Description	Default
`TWINKLE_NPU_PATCH`	Master switch for all NPU optimizations	`1` (enabled)
`TWINKLE_NPU_FUSED_OPS`	Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA)	`1` (enabled)
`TWINKLE_NPU_MOE_PATCH`	Enable MoE Grouped MatMul	`1` (enabled)
`TWINKLE_NPU_FLA`	Enable Qwen3.5 Flash Linear Attention; set to `0` to force torch fallback	`1` (enabled)

Usage examples:

# Disable all NPU optimizations and fall back to native Transformers
export TWINKLE_NPU_PATCH=0

# Disable FLA only while keeping other fused operators
export TWINKLE_NPU_FLA=0

# Disable MoE patch only
export TWINKLE_NPU_MOE_PATCH=0

Quick Start

Important Notice: The following examples are from the cookbook/ directory and have been verified in actual NPU environments. It is recommended to run scripts directly from the cookbook rather than copying and pasting code snippets.

SFT LoRA Fine-tuning

The NPU document no longer provides this kind of SFT cookbook example; this capability should be described together with an actually available cookbook example or a future NPU script.

GRPO Reinforcement Learning Training

The NPU document no longer provides this kind of GRPO cookbook example; this capability should be described together with an actually available cookbook example or a future NPU script.

More Examples

Check the cookbook/remote/tinker/ascend/ directory for remote training server-side configuration.

Parallelization Strategies

Twinkle currently supports the following verified parallelization strategies on NPU:

Parallel Type	Description	NPU Support	Verification Status
DP (Data Parallel)	Data parallelism	✅	No corresponding cookbook example
FSDP (Fully Sharded Data Parallel)	Fully sharded data parallelism	✅	No corresponding cookbook example
TP (Tensor Parallel)	Tensor parallelism (Megatron)	✅	Verified (see `cookbook/megatron/ascend/tp_npu.py`)
PP (Pipeline Parallel)	Pipeline parallelism (Megatron)	✅	Verified (see `cookbook/megatron/ascend/tp_npu.py`)
CP (Context Parallel)	Context parallelism	✅	Verified (see `cookbook/megatron/ascend/tp_moe_cp_npu.py`)
EP (Expert Parallel)	Expert parallelism (MoE)	✅	Verified (see `cookbook/megatron/ascend/tp_moe_npu.py`)

Legend:

✅ Verified: Has actual running example code
🚧 To be verified: Theoretically supported but no NPU verification example yet
❌ Not supported: Not available in current version

DP + FSDP Example

The NPU document currently does not provide a corresponding cookbook code snippet.

Megatron backend note: Twinkle now provides runnable NPU smoke scripts for the Megatron backend. Please follow the installation section above before running the cookbook examples, and start with cookbook/megatron/ascend/tp_npu.py before moving on to cookbook/megatron/ascend/tp_moe_npu.py and cookbook/megatron/ascend/tp_moe_cp_npu.py.

Common Issues

1. torch_npu Version Mismatch

Problem: Version incompatibility warnings or errors after installing torch_npu.

Solution:

Ensure torch and torch_npu versions are exactly the same
Check if CANN version is compatible with torch_npu

# Check current versions
python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"

# Reinstall matching versions
pip uninstall torch torch_npu -y
pip install torch==2.7.1
pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl

2. CANN Toolkit Version Issue

Problem: CANN version incompatible with torch_npu.

Solution:

Refer to Ascend Community Version Compatibility Table
Install corresponding CANN toolkit version

Feature Support Status

Feature support matrix based on actual code verification:

Feature	GPU	NPU	Verification Example	Description
SFT + LoRA	✅	✅	-	No corresponding cookbook example
GRPO	✅	✅	-	No corresponding cookbook example
DP Parallelism	✅	✅	-	No corresponding cookbook example
FSDP Parallelism	✅	✅	-	No corresponding cookbook example
Ray Distributed	✅	✅	-	No corresponding cookbook example
TorchSampler	✅	✅	-	No corresponding cookbook example
vLLMSampler	✅	✅	-	No corresponding cookbook example
Full Fine-tuning	✅	✅	-	Verified available
QLoRA	✅	❌	-	Quantization operators not yet supported
DPO	✅	🚧	-	Theoretically supported, to be verified
Megatron TP/PP	✅	🚧	-	To be adapted and verified
Flash Attention	✅	⚠️	-	Some operators not supported

Legend:

✅ Verified: Has actual running example, confirmed available
🚧 To be verified: Theoretically supported but no NPU environment verification yet
⚠️ Partial support: Available but with limitations or performance differences
❌ Not supported: Not available in current version

Usage Recommendations:

Prioritize features marked as “Verified” for guaranteed stability
“To be verified” features can be attempted but may encounter compatibility issues
Refer to corresponding example code when encountering problems

Example Code

Twinkle’s verified NPU examples currently focus on the Megatron smoke path; the SFT and GRPO cookbook examples do not have corresponding files yet.

Remote Training (Tinker Protocol)

Server Configuration: cookbook/remote/tinker/ascend/
- Provides HTTP API interface
- Supports remote training and inference
- Suitable for production environment deployment

Running Examples: No corresponding command examples are provided yet.

Reference Resources

Getting Help

If you encounter issues during use:

Check Logs: Set environment variable ASCEND_GLOBAL_LOG_LEVEL=1 for detailed logs
Submit Issue: Twinkle GitHub Issues
Community Discussion: Ascend Community Forum

Next Steps

📖 Read Quick Start for more training examples
📖 Read Installation Guide for other platform installations
🚀 Browse the cookbook/ directory for complete example code
💡 Check Twinkle Documentation for advanced features

← Twinkle Installation

Twinkle Training Service on ModelScope →

No results found