<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Megatron | Twinkle</title><link>https://modelscope.github.io/twinkle-web/tags/megatron/</link><atom:link href="https://modelscope.github.io/twinkle-web/tags/megatron/index.xml" rel="self" type="application/rss+xml"/><description>Megatron</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 01 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://modelscope.github.io/twinkle-web/media/logo_hu_fedc6a0bfe689b18.png</url><title>Megatron</title><link>https://modelscope.github.io/twinkle-web/tags/megatron/</link></image><item><title>Multi-LoRA: Concurrent Multi-Tenant Training on Shared GPUs</title><link>https://modelscope.github.io/twinkle-web/blog/multi-lora/</link><pubDate>Mon, 01 Jun 2026 00:00:00 +0000</pubDate><guid>https://modelscope.github.io/twinkle-web/blog/multi-lora/</guid><description>&lt;p&gt;Twinkle&amp;rsquo;s Multi-LoRA architecture enables multiple tenants to train independent LoRA adapters on a &lt;strong&gt;single shared model&lt;/strong&gt; simultaneously. This post explains the technical design, covering both the Transformers and Megatron backends.&lt;/p&gt;
&lt;h2 id="why-multi-lora"&gt;Why Multi-LoRA?&lt;/h2&gt;
&lt;p&gt;Traditional LoRA training loads a full base model per user. For a 70B model this means ~140 GB of GPU memory per tenant — an enormous waste when the frozen base weights are identical across all users. Multi-LoRA solves this by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sharing the base model&lt;/strong&gt;: All tenants share one copy of frozen base weights.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pre-allocating adapter slots&lt;/strong&gt;: A fixed pool of LoRA adapter slots (&lt;code&gt;max_loras × max_r&lt;/code&gt;) is allocated at initialization, avoiding runtime memory fragmentation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic tenant switching&lt;/strong&gt;: Tenants acquire/release adapters on-the-fly with near-zero context-switch overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="architecture-overview"&gt;Architecture Overview&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;┌──────────────────────────────────────────┐&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Shared&lt;/span&gt; &lt;span class="n"&gt;Base&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Frozen&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loaded&lt;/span&gt; &lt;span class="n"&gt;once&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──────────────────────────────────────────┤&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;MultiLora&lt;/span&gt; &lt;span class="n"&gt;Manager&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;┌────────┐&lt;/span&gt; &lt;span class="err"&gt;┌────────┐&lt;/span&gt; &lt;span class="err"&gt;┌────────┐&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Slot&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Slot&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Slot&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;&lt;span class="n"&gt;Tenant&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;&lt;span class="n"&gt;Tenant&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Free&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;└────────┘&lt;/span&gt; &lt;span class="err"&gt;└────────┘&lt;/span&gt; &lt;span class="err"&gt;└────────┘&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;├──────────────────────────────────────────┤&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Per&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Tenant&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LR&lt;/span&gt; &lt;span class="n"&gt;Scheduler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Template&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Gradient&lt;/span&gt; &lt;span class="n"&gt;Accumulation&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="err"&gt;└──────────────────────────────────────────┘&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;MultiLora&lt;/code&gt; class manages the lifecycle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;patch(model)&lt;/code&gt;&lt;/strong&gt; — Patches every &lt;code&gt;LoLayer&lt;/code&gt; forward method to iterate over active adapters, applying LoRA weights with proper scaling.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;acquire_lora(tenant, config)&lt;/code&gt;&lt;/strong&gt; — Assigns a pre-allocated slot to a tenant with the given &lt;code&gt;LoraConfig&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;adapter(name)&lt;/code&gt;&lt;/strong&gt; — Context manager that activates a specific adapter for forward/backward passes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;release_lora(tenant)&lt;/code&gt;&lt;/strong&gt; — Restores initial weights and returns the slot to the free pool.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="transformers-backend"&gt;Transformers Backend&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;MultiLoraTransformersModel&lt;/code&gt; wraps the standard &lt;code&gt;TransformersModel&lt;/code&gt; with per-adapter isolation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MultiLoraTransformersModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Qwen/Qwen3.5-72B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_loras&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Tenant A registers their adapter&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_adapter_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tenant_a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;all-linear&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_optimizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer_cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tenant_a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Tenant B registers independently&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_adapter_to_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tenant_b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;all-linear&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_optimizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer_cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tenant_b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Each tenant trains independently — gradients are isolated&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;forward_backward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tenant_a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clip_grad_and_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adapter_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tenant_a&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Key design choices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Optimizer Groups&lt;/strong&gt;: Each adapter has its own optimizer, LR scheduler, and gradient accumulation settings stored in an &lt;code&gt;OptimizerGroup&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context-switched forward&lt;/strong&gt;: Every &lt;code&gt;forward_backward&lt;/code&gt;, &lt;code&gt;step&lt;/code&gt;, and &lt;code&gt;zero_grad&lt;/code&gt; call is wrapped with &lt;code&gt;self.multi_adapter.adapter(name)&lt;/code&gt; to ensure gradient isolation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Independent checkpointing&lt;/strong&gt;: &lt;code&gt;save()&lt;/code&gt; extracts only the active adapter&amp;rsquo;s state dict, so tenants never see each other&amp;rsquo;s weights.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="megatron-backend"&gt;Megatron Backend&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;MultiLoraMegatronModel&lt;/code&gt; extends Megatron&amp;rsquo;s tensor/pipeline parallel training with multi-tenant support. The key challenge is that Megatron uses a &lt;strong&gt;distributed optimizer&lt;/strong&gt; that sees all parameters — but we need per-adapter gradient isolation.&lt;/p&gt;
&lt;p&gt;The solution: &lt;strong&gt;&lt;code&gt;optimizer_context&lt;/code&gt; manager&lt;/strong&gt; that temporarily replaces &lt;code&gt;named_parameters()&lt;/code&gt; on each pipeline-parallel module, filtering to only yield parameters matching the active adapter&amp;rsquo;s regex pattern:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@contextmanager&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;optimizer_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;\.lora_\w+\.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adapter_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;\.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;orig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_filtered&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="c1"&gt;# restore original named_parameters&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This ensures the optimizer only updates the target adapter&amp;rsquo;s LoRA weights, even in a distributed setting with TP/PP sharding.&lt;/p&gt;
&lt;p&gt;Additional Megatron-specific features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Per-rank optimizer checkpointing&lt;/strong&gt;: Each rank saves its own optimizer state, enabling efficient multi-GPU resume.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HF + Megatron format export&lt;/strong&gt;: Save adapters in either HuggingFace PEFT format or native Megatron format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RNG state isolation&lt;/strong&gt;: Global RNG is intentionally &lt;em&gt;not&lt;/em&gt; restored when loading a tenant checkpoint to avoid silently affecting other active tenants&amp;rsquo; dropout behavior.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="performance"&gt;Performance&lt;/h2&gt;
&lt;p&gt;By sharing base model weights across tenants, Multi-LoRA reduces GPU memory usage proportionally:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tenants&lt;/th&gt;
&lt;th&gt;Traditional (N × full model)&lt;/th&gt;
&lt;th&gt;Multi-LoRA (1 model + N adapters)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;140 GB&lt;/td&gt;
&lt;td&gt;140 GB + 0.1 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;700 GB&lt;/td&gt;
&lt;td&gt;140 GB + 0.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1400 GB&lt;/td&gt;
&lt;td&gt;140 GB + 1.0 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Estimates for a 70B model with LoRA r=16.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="getting-started"&gt;Getting Started&lt;/h2&gt;
&lt;p&gt;See the
for a complete example.&lt;/p&gt;</description></item></channel></rss>