Multi-LoRA: Concurrent Multi-Tenant Training on Shared GPUs
Twinkle’s Multi-LoRA architecture enables multiple tenants to train independent LoRA adapters on a single shared model simultaneously. This post explains the technical design, covering both the Transformers and Megatron backends.
Why Multi-LoRA?
Traditional LoRA training loads a full base model per user. For a 70B model this means ~140 GB of GPU memory per tenant — an enormous waste when the frozen base weights are identical across all users. Multi-LoRA solves this by:
- Sharing the base model: All tenants share one copy of frozen base weights.
- Pre-allocating adapter slots: A fixed pool of LoRA adapter slots (
max_loras × max_r) is allocated at initialization, avoiding runtime memory fragmentation. - Dynamic tenant switching: Tenants acquire/release adapters on-the-fly with near-zero context-switch overhead.
Architecture Overview
┌──────────────────────────────────────────┐
│ Shared Base Model │
│ (Frozen weights, loaded once) │
├──────────────────────────────────────────┤
│ MultiLora Manager │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Slot 0 │ │ Slot 1 │ │ Slot 2 │ ... │
│ │Tenant A│ │Tenant B│ │ Free │ │
│ └────────┘ └────────┘ └────────┘ │
├──────────────────────────────────────────┤
│ Per-Tenant: Optimizer, LR Scheduler, │
│ Template, Gradient Accumulation │
└──────────────────────────────────────────┘
The MultiLora class manages the lifecycle:
patch(model)— Patches everyLoLayerforward method to iterate over active adapters, applying LoRA weights with proper scaling.acquire_lora(tenant, config)— Assigns a pre-allocated slot to a tenant with the givenLoraConfig.adapter(name)— Context manager that activates a specific adapter for forward/backward passes.release_lora(tenant)— Restores initial weights and returns the slot to the free pool.
Transformers Backend
MultiLoraTransformersModel wraps the standard TransformersModel with per-adapter isolation:
model = MultiLoraTransformersModel(model_id='Qwen/Qwen3.5-72B', max_loras=5)
# Tenant A registers their adapter
model.add_adapter_to_model('tenant_a', LoraConfig(r=16, target_modules='all-linear'))
model.set_optimizer(optimizer_cls=Adam, lr=1e-4, adapter_name='tenant_a')
# Tenant B registers independently
model.add_adapter_to_model('tenant_b', LoraConfig(r=8, target_modules='all-linear'))
model.set_optimizer(optimizer_cls=Adam, lr=2e-4, adapter_name='tenant_b')
# Each tenant trains independently — gradients are isolated
model.forward_backward(inputs=batch_a, adapter_name='tenant_a')
model.clip_grad_and_step(adapter_name='tenant_a')
Key design choices:
- Optimizer Groups: Each adapter has its own optimizer, LR scheduler, and gradient accumulation settings stored in an
OptimizerGroup. - Context-switched forward: Every
forward_backward,step, andzero_gradcall is wrapped withself.multi_adapter.adapter(name)to ensure gradient isolation. - Independent checkpointing:
save()extracts only the active adapter’s state dict, so tenants never see each other’s weights.
Megatron Backend
MultiLoraMegatronModel extends Megatron’s tensor/pipeline parallel training with multi-tenant support. The key challenge is that Megatron uses a distributed optimizer that sees all parameters — but we need per-adapter gradient isolation.
The solution: optimizer_context manager that temporarily replaces named_parameters() on each pipeline-parallel module, filtering to only yield parameters matching the active adapter’s regex pattern:
@contextmanager
def optimizer_context(self, adapter_name: str):
pattern = re.compile(rf'\.lora_\w+\.{re.escape(adapter_name)}\.')
for module in self.model:
orig = module.named_parameters
module.named_parameters = make_filtered(orig, pattern)
yield
# restore original named_parameters
This ensures the optimizer only updates the target adapter’s LoRA weights, even in a distributed setting with TP/PP sharding.
Additional Megatron-specific features:
- Per-rank optimizer checkpointing: Each rank saves its own optimizer state, enabling efficient multi-GPU resume.
- HF + Megatron format export: Save adapters in either HuggingFace PEFT format or native Megatron format.
- RNG state isolation: Global RNG is intentionally not restored when loading a tenant checkpoint to avoid silently affecting other active tenants’ dropout behavior.
Performance
By sharing base model weights across tenants, Multi-LoRA reduces GPU memory usage proportionally:
| Tenants | Traditional (N × full model) | Multi-LoRA (1 model + N adapters) |
|---|---|---|
| 1 | 140 GB | 140 GB + 0.1 GB |
| 5 | 700 GB | 140 GB + 0.5 GB |
| 10 | 1400 GB | 140 GB + 1.0 GB |
Estimates for a 70B model with LoRA r=16.
Getting Started
See the Multi-LoRA DPO Cookbook for a complete example.