<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Infrastructure | Twinkle</title><link>https://modelscope.github.io/twinkle-web/tags/infrastructure/</link><atom:link href="https://modelscope.github.io/twinkle-web/tags/infrastructure/index.xml" rel="self" type="application/rss+xml"/><description>Infrastructure</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Wed, 03 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://modelscope.github.io/twinkle-web/media/logo_hu_fedc6a0bfe689b18.png</url><title>Infrastructure</title><link>https://modelscope.github.io/twinkle-web/tags/infrastructure/</link></image><item><title>Two Execution Modes: torchrun (Local) vs Ray (Distributed)</title><link>https://modelscope.github.io/twinkle-web/blog/torchrun-ray/</link><pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate><guid>https://modelscope.github.io/twinkle-web/blog/torchrun-ray/</guid><description>&lt;p&gt;Twinkle&amp;rsquo;s &lt;code&gt;infra&lt;/code&gt; module provides a unified programming model that runs seamlessly in two modes: &lt;strong&gt;local&lt;/strong&gt; (single-node via torchrun) and &lt;strong&gt;ray&lt;/strong&gt; (multi-node via Ray cluster). This post explains the architecture, the decorator-based API, and when to use each mode.&lt;/p&gt;
&lt;h2 id="the-two-modes-at-a-glance"&gt;The Two Modes at a Glance&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Local (torchrun)&lt;/th&gt;
&lt;th&gt;Ray (Distributed)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Launch&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torchrun --nproc_per_node=N&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ray start&lt;/code&gt; + driver script&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Single node, shared filesystem&lt;/td&gt;
&lt;td&gt;Multi-node cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Process model&lt;/td&gt;
&lt;td&gt;One process per GPU, torch.distributed&lt;/td&gt;
&lt;td&gt;Ray actors with PlacementGroups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Quick experiments, single-machine training&lt;/td&gt;
&lt;td&gt;Production multi-node, heterogeneous resources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Both modes share the &lt;strong&gt;same user code&lt;/strong&gt; — switching requires only changing the &lt;code&gt;mode&lt;/code&gt; parameter in &lt;code&gt;twinkle.infra.initialize()&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="initialization"&gt;Initialization&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;twinkle.infra&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;infra&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Local mode — auto-detects ranks and devices from torchrun env vars&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;infra&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;local&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Ray mode — requires explicit DeviceGroup definitions&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;infra&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ray&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;nproc_per_node&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;groups&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;DeviceGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ranks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;device_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cuda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;DeviceGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sampler&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ranks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;device_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cuda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In &lt;strong&gt;local mode&lt;/strong&gt;, Twinkle reads &lt;code&gt;WORLD_SIZE&lt;/code&gt;, &lt;code&gt;RANK&lt;/code&gt;, and &lt;code&gt;LOCAL_RANK&lt;/code&gt; from the environment (set by torchrun) and creates a single default &lt;code&gt;DeviceGroup&lt;/code&gt; spanning all GPUs. A &lt;code&gt;DeviceMesh&lt;/code&gt; is auto-constructed with a &lt;code&gt;dp&lt;/code&gt; dimension.&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;ray mode&lt;/strong&gt;, &lt;code&gt;RayHelper.initialize()&lt;/code&gt; creates a &lt;code&gt;ResourceManager&lt;/code&gt; that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Queries Ray cluster nodes for available GPUs/NPUs&lt;/li&gt;
&lt;li&gt;Creates &lt;code&gt;PlacementGroup&lt;/code&gt; bundles — one per node — to guarantee co-located resources&lt;/li&gt;
&lt;li&gt;Maps each logical rank to a physical GPU via &lt;code&gt;visible_devices&lt;/code&gt; discovery&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="the-decorator-api"&gt;The Decorator API&lt;/h2&gt;
&lt;p&gt;Twinkle&amp;rsquo;s key abstraction is two decorators that make any class distributed-transparent:&lt;/p&gt;
&lt;h3 id="remote_class"&gt;&lt;code&gt;@remote_class&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Wraps a class so that &lt;code&gt;__init__&lt;/code&gt; runs either locally or creates Ray actors:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@infra.remote_class&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;all&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_mesh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DeviceMesh&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In local mode, &lt;code&gt;__init__&lt;/code&gt; runs normally. In Ray mode, &lt;code&gt;RayHelper.create_workers()&lt;/code&gt; spawns one Ray actor per GPU rank in the specified &lt;code&gt;DeviceGroup&lt;/code&gt;, each with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Isolated &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; pointing to its assigned physical GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MASTER_ADDR&lt;/code&gt; / &lt;code&gt;MASTER_PORT&lt;/code&gt; for torch.distributed init&lt;/li&gt;
&lt;li&gt;Proper &lt;code&gt;WORLD_SIZE&lt;/code&gt; / &lt;code&gt;RANK&lt;/code&gt; environment variables&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="remote_function"&gt;&lt;code&gt;@remote_function&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Wraps methods with dispatch, execution, and collection semantics:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@infra.remote_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dispatch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;slice_dp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loss&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Three knobs control distributed behavior:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;dispatch&lt;/strong&gt; — how arguments are split across workers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;'all'&lt;/code&gt;: Every worker receives the same arguments&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'slice'&lt;/code&gt;: Arguments are evenly partitioned across workers&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'slice_dp'&lt;/code&gt;: Arguments are partitioned along the data-parallel dimension of the DeviceMesh (EP-aware)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;execute&lt;/strong&gt; — which workers run:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;'all'&lt;/code&gt;: All workers (default)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'first'&lt;/code&gt;: Only the first worker&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'peer'&lt;/code&gt;: Only peer workers (for inter-group communication)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;collect&lt;/strong&gt; — how results are aggregated:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;'none'&lt;/code&gt;: Return raw list of results&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'mean'&lt;/code&gt; / &lt;code&gt;'sum'&lt;/code&gt;: Reduce numerically&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'first'&lt;/code&gt;: Return first worker&amp;rsquo;s result&lt;/li&gt;
&lt;li&gt;&lt;code&gt;'last_pp'&lt;/code&gt;: Return results from the last pipeline-parallel stage&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Callable&lt;/code&gt;: Custom aggregation function&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="lazycollect-deferred-result-aggregation"&gt;LazyCollect: Deferred Result Aggregation&lt;/h2&gt;
&lt;p&gt;A key optimization in Ray mode is &lt;strong&gt;LazyCollect&lt;/strong&gt;. Instead of blocking on &lt;code&gt;ray.get()&lt;/code&gt; immediately after each remote call, results are wrapped in a &lt;code&gt;LazyCollect&lt;/code&gt; callable:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# returns LazyCollect (non-blocking)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# ... do other work ...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;actual_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# blocks only when value is needed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This enables overlapping computation and communication — the driver can dispatch work to multiple groups (model, sampler, processor) and only block when results are actually consumed.&lt;/p&gt;
&lt;p&gt;LazyCollect also supports &lt;code&gt;__iter__&lt;/code&gt; and &lt;code&gt;__len__&lt;/code&gt;, making it transparent to most consumer code.&lt;/p&gt;
&lt;h2 id="resourcemanager-gpu-allocation"&gt;ResourceManager: GPU Allocation&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;ResourceManager&lt;/code&gt; handles the complexity of GPU-to-node mapping:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Node discovery&lt;/strong&gt; — Queries Ray for all live nodes and their GPU counts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PlacementGroup creation&lt;/strong&gt; — Creates one PG per node with &lt;code&gt;{GPU: N, CPU: node_cpu//2}&lt;/code&gt; bundles&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU mapping&lt;/strong&gt; — Discovers actual &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; on each node to correctly map logical ranks to physical GPUs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-accelerator support&lt;/strong&gt; — Works with GPU, NPU, and other accelerators via &lt;code&gt;Platform&lt;/code&gt; abstraction. Uses &lt;code&gt;RAY_EXPERIMENTAL_NOSET_*&lt;/code&gt; env vars to prevent Ray from overriding device visibility&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU worker support&lt;/strong&gt; — Separate PlacementGroups for CPU-only processes (data processors)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="device-placement-visualization"&gt;Device Placement Visualization&lt;/h2&gt;
&lt;p&gt;Twinkle provides &lt;code&gt;get_device_placement()&lt;/code&gt; to render the training topology:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;╔══════════════════════════════════════════════════════════════════════════════╗
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;║ DEVICE PLACEMENT TOPOLOGY ║
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;╚══════════════════════════════════════════════════════════════════════════════╝
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;┌──────────────────────────────────────────────────────────────────────────────┐
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ◈ DeviceGroup: model │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;├──────────────────────────────────────────────────────────────────────────────┤
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ├─ Device Type : cuda │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ └─ Ranks : [0, 1, 2, 3] │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ ┌─ DeviceMesh: MyModel │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ Dimensions : dp=4 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;│ │ Parallelism: DP=4 │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;└──────────────────────────────────────────────────────────────────────────────┘
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="error-handling-and-notifications"&gt;Error Handling and Notifications&lt;/h2&gt;
&lt;p&gt;Remote functions automatically capture the &lt;strong&gt;driver-side call site&lt;/strong&gt; and attach it to any exception raised inside workers:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;[twinkle driver caller: train.py:42] CUDA out of memory
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;An optional &lt;code&gt;notifier&lt;/code&gt; (e.g. DingTalk webhook) can be passed to &lt;code&gt;initialize()&lt;/code&gt; to receive alerts when any remote function fails — useful for long-running distributed jobs.&lt;/p&gt;
&lt;h2 id="when-to-use-which-mode"&gt;When to Use Which Mode&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use local mode when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single machine with 1-8 GPUs&lt;/li&gt;
&lt;li&gt;Quick prototyping and debugging&lt;/li&gt;
&lt;li&gt;Simple data-parallel training&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Ray mode when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-node clusters&lt;/li&gt;
&lt;li&gt;Heterogeneous resource allocation (model GPUs + sampler GPUs + CPU processors)&lt;/li&gt;
&lt;li&gt;Production training with fault tolerance needs&lt;/li&gt;
&lt;li&gt;Multi-model deployments (training + inference in the same cluster)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The beauty of Twinkle&amp;rsquo;s design is that your training code stays the same — only the &lt;code&gt;initialize()&lt;/code&gt; call changes.&lt;/p&gt;</description></item></channel></rss>