Modern LLMs demand ever-longer context windows — 128K, 256K, even 1M tokens. A single GPU cannot hold such long sequences in memory. Twinkle’s Sequence Parallel module solves this by splitting the sequence dimension across multiple devices, combining Ulysses-style All-to-All parallelism with ZigZag Ring Attention to achieve near-linear scaling.
Twinkle provides first-class support for Huawei Ascend NPU through a comprehensive monkey-patching system that replaces standard CUDA operators with NPU-optimized fused kernels. This post covers the kernel architecture and the optimizations enabled.
Twinkle’s infra module provides a unified programming model that runs seamlessly in two modes: local (single-node via torchrun) and ray (multi-node via Ray cluster). This post explains the architecture, the decorator-based API, and when to use each mode.
Twinkle ships a terminal-based UI (TUI) powered by an embedded LLM agent that can autonomously start, monitor, pause, and debug ML training runs. This post covers the architecture of the TUI, the agent loop, and the tool system that makes “auto-research” possible.
Twinkle’s Multi-LoRA architecture enables multiple tenants to train independent LoRA adapters on a single shared model simultaneously. This post explains the technical design, covering both the Transformers and Megatron backends.
Twinkle’s envs module bridges the gap between asynchronous external environments (code sandboxes, web browsers, game engines) and synchronous RL training loops. This post explains the Env abstraction, the EnvTool adapter, and the OpenEnv WebSocket client.
We’re excited to announce that Twinkle Training-as-a-Service (TaaS) is now available on ModelScope! Developers can experience Twinkle’s training API for free—no GPU cluster required.