Distributed Training

Sequence Parallel & Ring Attention: Training with Ultra-Long Contexts

Modern LLMs demand ever-longer context windows — 128K, 256K, even 1M tokens. A single GPU cannot hold such long sequences in memory. Twinkle’s Sequence Parallel module solves this …

Two Execution Modes: torchrun (Local) vs Ray (Distributed)

Twinkle’s infra module provides a unified programming model that runs seamlessly in two modes: local (single-node via torchrun) and ray (multi-node via Ray cluster). This post …