<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>NPU | Twinkle</title><link>https://modelscope.github.io/twinkle-web/tags/npu/</link><atom:link href="https://modelscope.github.io/twinkle-web/tags/npu/index.xml" rel="self" type="application/rss+xml"/><description>NPU</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 05 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://modelscope.github.io/twinkle-web/media/logo_hu_fedc6a0bfe689b18.png</url><title>NPU</title><link>https://modelscope.github.io/twinkle-web/tags/npu/</link></image><item><title>Ascend NPU Support: Fused Operators and Flash Linear Attention</title><link>https://modelscope.github.io/twinkle-web/blog/npu-support/</link><pubDate>Fri, 05 Jun 2026 00:00:00 +0000</pubDate><guid>https://modelscope.github.io/twinkle-web/blog/npu-support/</guid><description>&lt;p&gt;Twinkle provides first-class support for &lt;strong&gt;Huawei Ascend NPU&lt;/strong&gt; through a comprehensive monkey-patching system that replaces standard CUDA operators with NPU-optimized fused kernels. This post covers the kernel architecture and the optimizations enabled.&lt;/p&gt;
&lt;h2 id="kernel-architecture"&gt;Kernel Architecture&lt;/h2&gt;
&lt;p&gt;Twinkle&amp;rsquo;s kernel module (&lt;code&gt;twinkle.kernel&lt;/code&gt;) provides a unified entry point &lt;code&gt;kernelize_model()&lt;/code&gt; that automatically detects the device and applies appropriate optimizations:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;twinkle.kernel&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;kernelize_model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kernelize_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;npu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# or auto-detected&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On NPU devices, the following fused operators are applied &lt;strong&gt;unconditionally&lt;/strong&gt;:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;NPU Implementation&lt;/th&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RMSNorm&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torch_npu.npu_rms_norm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fused normalization, ~2x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RoPE&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torch_npu.npu_rotary_mul&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fused rotary embedding with partial RoPE support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SwiGLU&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torch_npu.npu_swiglu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fused gate+up projection activation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SDPA&lt;/td&gt;
&lt;td&gt;NPU-compatible &lt;code&gt;scaled_dot_product_attention&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Correct mask handling for NPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MoE GMM&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torch_npu.npu_grouped_matmul&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;EP-aware grouped matrix multiply&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLA&lt;/td&gt;
&lt;td&gt;MindSpeed Triton backend&lt;/td&gt;
&lt;td&gt;Flash Linear Attention for Qwen3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="fused-operators-in-detail"&gt;Fused Operators in Detail&lt;/h2&gt;
&lt;h3 id="rmsnorm-with-residual-parameterization"&gt;RMSNorm with Residual Parameterization&lt;/h3&gt;
&lt;p&gt;Twinkle&amp;rsquo;s &lt;code&gt;NpuRMSNorm&lt;/code&gt; detects the &lt;strong&gt;residual parameterization&lt;/strong&gt; pattern used by Qwen3.5 (where &lt;code&gt;scale = 1.0 + weight&lt;/code&gt;) at initialization time, avoiding CPU-synchronizing &lt;code&gt;Tensor.item()&lt;/code&gt; calls in the hot path:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NpuRMSNorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-6&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# Detect once at init&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_residual_param&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_states&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_residual_param&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch_npu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;npu_rms_norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;epsilon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="ep-aware-moe-optimization"&gt;EP-Aware MoE Optimization&lt;/h3&gt;
&lt;p&gt;The MoE grouped matmul patch is &lt;strong&gt;EP-aware&lt;/strong&gt; — it only activates when Expert Parallelism is enabled (each rank holds a subset of experts, weights are small and contiguous). Without EP, each rank holds &lt;strong&gt;all&lt;/strong&gt; experts, and the transpose+contiguous copy creates ~8x overhead:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;TWINKLE_NPU_GMM_PATCH not set → skip (default safe)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;TWINKLE_NPU_GMM_PATCH=1 + EP enabled → apply (efficient)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;TWINKLE_NPU_GMM_PATCH=1 + EP disabled → skip (avoid 8x overhead)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;GmmFunction&lt;/code&gt; autograd function wraps &lt;code&gt;torch_npu.npu_grouped_matmul&lt;/code&gt; with full backward support, and weights are cached with automatic invalidation when updated (full-param training bumps &lt;code&gt;_version&lt;/code&gt;, LoRA keeps it stable).&lt;/p&gt;
&lt;h3 id="flash-linear-attention-for-qwen35"&gt;Flash Linear Attention for Qwen3.5&lt;/h3&gt;
&lt;p&gt;Qwen3.5 introduces a hybrid architecture mixing standard attention with linear attention layers. Twinkle enables the &lt;strong&gt;FLA fast path&lt;/strong&gt; on NPU via MindSpeed&amp;rsquo;s Triton implementation of &lt;code&gt;chunk_gated_delta_rule&lt;/code&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Force &lt;code&gt;is_flash_linear_attention_available = True&lt;/code&gt; in transformers&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;chunk_gated_delta_rule&lt;/code&gt; with MindSpeed NPU-compatible implementation&lt;/li&gt;
&lt;li&gt;Traverse instantiated model to patch per-layer instances&lt;/li&gt;
&lt;li&gt;Disable CUDA-only &lt;code&gt;FusedRMSNormGated&lt;/code&gt; that would fail on NPU&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The MindSpeed implementation provides chunked forward/backward with WY representation, supporting variable-length sequences via &lt;code&gt;cu_seqlens&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="environment-variable-control"&gt;Environment Variable Control&lt;/h2&gt;
&lt;p&gt;Every optimization is independently controllable:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TWINKLE_NPU_PATCH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Master switch for all NPU patches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TWINKLE_NPU_FUSED_OPS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fused operators (RMSNorm/RoPE/SwiGLU/SDPA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TWINKLE_NPU_GMM_PATCH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;unset&lt;/td&gt;
&lt;td&gt;MoE grouped matmul (EP-aware)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TWINKLE_NPU_FLA&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Flash Linear Attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TWINKLE_NPU_GATED_RMSNorm_FP32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Force FP32 for Gated RMSNorm&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="supported-model-families"&gt;Supported Model Families&lt;/h2&gt;
&lt;p&gt;The patching system automatically discovers and patches compatible model families:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Qwen3&lt;/strong&gt; / &lt;strong&gt;Qwen3-MoE&lt;/strong&gt; — Full operator fusion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qwen3.5&lt;/strong&gt; / &lt;strong&gt;Qwen3.5-MoE&lt;/strong&gt; — Full fusion + FLA + Gated RMSNorm&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qwen2.5-VL&lt;/strong&gt; — Full fusion + multimodal RoPE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic discovery&lt;/strong&gt; — Unknown models are scanned for compatible RMSNorm/RoPE/SwiGLU patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="getting-started"&gt;Getting Started&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Install NPU dependencies&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;pip install torch-npu mindspeed
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Training automatically uses NPU optimizations&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nv"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0,1,2,3 torchrun --nproc_per_node&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt; train.py
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;See the
for detailed setup instructions.&lt;/p&gt;</description></item></channel></rss>