Distributed Training

Sequence Parallel & Ring Attention: Training with Ultra-Long Contexts

Modern LLMs demand ever-longer context windows — 128K, 256K, even 1M tokens. A single GPU cannot hold such long sequences in memory. Twinkle’s Sequence Parallel module solves this …

Jun 22, 2026 • 5 min read

Infrastructure

Two Execution Modes: torchrun (Local) vs Ray (Distributed)

Twinkle’s infra module provides a unified programming model that runs seamlessly in two modes: local (single-node via torchrun) and ray (multi-node via Ray cluster). This post …

Jun 3, 2026 • 3 min read

No results found

Distributed Training

Sequence Parallel & Ring Attention: Training with Ultra-Long Contexts

Two Execution Modes: torchrun (Local) vs Ray (Distributed)