Blog

Sequence Parallel & Ring Attention: Training with Ultra-Long Contexts

Modern LLMs demand ever-longer context windows — 128K, 256K, even 1M tokens. A single GPU cannot hold such long sequences in memory. Twinkle’s Sequence Parallel module solves this by splitting the sequence dimension across multiple devices, combining Ulysses-style All-to-All parallelism with ZigZag Ring Attention to achieve near-linear scaling.

Ascend NPU Support: Fused Operators and Flash Linear Attention

Twinkle provides first-class support for Huawei Ascend NPU through a comprehensive monkey-patching system that replaces standard CUDA operators with NPU-optimized fused kernels. This post covers the kernel architecture and the optimizations enabled.

Two Execution Modes: torchrun (Local) vs Ray (Distributed)

Twinkle’s infra module provides a unified programming model that runs seamlessly in two modes: local (single-node via torchrun) and ray (multi-node via Ray cluster). This post explains the architecture, the decorator-based API, and when to use each mode.

TUI & Auto-Research: An AI Agent for Training Control

Twinkle ships a terminal-based UI (TUI) powered by an embedded LLM agent that can autonomously start, monitor, pause, and debug ML training runs. This post covers the architecture of the TUI, the agent loop, and the tool system that makes “auto-research” possible.

Multi-LoRA: Concurrent Multi-Tenant Training on Shared GPUs

Twinkle’s Multi-LoRA architecture enables multiple tenants to train independent LoRA adapters on a single shared model simultaneously. This post explains the technical design, covering both the Transformers and Megatron backends.

OpenEnv Integration: Connecting External Environments to RL Training

Twinkle’s envs module bridges the gap between asynchronous external environments (code sandboxes, web browsers, game engines) and synchronous RL training loops. This post explains the Env abstraction, the EnvTool adapter, and the OpenEnv WebSocket client.