Server & Client

Twinkle provides a complete HTTP Server/Client architecture for deploying models as services and remotely calling them for training and inference.

Core Concepts

The architecture decouples model hosting (Server) and training logic (Client):

  • Server: Deployed with Ray Serve, hosts model weights and handles forward/backward, sampling, and weight management
  • Client: Runs locally, handles data preparation, training loop, and hyperparameter configuration
┌──────────────────┐          HTTP          ┌──────────────────────────┐
│      Client      │ ◄───────────────────► │         Server           │
│  ┌────────────┐  │                       │  ┌────────────────────┐  │
│  │  Dataset   │  │     Data + Commands   │  │    Base Model      │  │
│  │  Template  │  │ ──────────────────►   │  ├────────────────────┤  │
│  │  Optimizer │  │                       │  │ LoRA A │ LoRA B │..│  │
│  └────────────┘  │  ◄──────────────────  │  └────────────────────┘  │
│                  │     Gradients + Metrics│                          │
└──────────────────┘                       └──────────────────────────┘

Two Model Backends

Backenduse_megatronDescription
TransformersfalseHuggingFace Transformers, suitable for most scenarios
MegatrontrueMegatron-LM, for ultra-large-scale models with advanced parallelization

Two Client Modes

ClientInitializationDescription
Twinkle Clientinit_twinkle_clientNative client, change from twinkle import to from twinkle_client import
Tinker Clientinit_tinker_clientPatches Tinker SDK, reuse existing Tinker training code

How to Choose

ScenarioRecommendation
Existing Twinkle local code, want remoteTwinkle Client — just change imports
Existing Tinker code, want to reuseTinker Client — only need init patch
New projectTwinkle Client — simpler API

Server Configuration

Basic Server Setup

Create server_config.yaml:

model:
  model_id: Qwen/Qwen3.5-4B
  use_megatron: false
  torch_dtype: bfloat16

server:
  host: 0.0.0.0
  port: 8000
  num_replicas: 1

ray:
  num_gpus: 4

Start the server:

# server.py
from twinkle.server import TwinkleServer

server = TwinkleServer.from_config('server_config.yaml')
server.run()
python server.py

Megatron Backend

For ultra-large models with tensor/pipeline parallelism:

model:
  model_id: Qwen/Qwen3.5-72B
  use_megatron: true
  torch_dtype: bfloat16
  tensor_parallel_size: 4
  pipeline_parallel_size: 2

server:
  host: 0.0.0.0
  port: 8000

ray:
  num_gpus: 8

Client Usage

Twinkle Client

import os
from twinkle_client import init_twinkle_client

# Initialize client
init_twinkle_client()

from twinkle_client import ServiceClient
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.preprocessor import SelfCognitionProcessor

# Connect to server
service_client = ServiceClient(
    base_url='http://localhost:8000',
    api_key=os.environ.get('API_KEY')
)

# Create training client
training_client = service_client.create_lora_training_client(
    base_model='Qwen/Qwen3.5-4B',
    rank=16
)

# Prepare data locally
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition'))
dataset.set_template('Template', model_id='ms://Qwen/Qwen3.5-4B')
dataset.map(SelfCognitionProcessor('My Model', 'My Team'))
dataset.encode()
dataloader = DataLoader(dataset=dataset, batch_size=8)

# Training loop
for epoch in range(2):
    for batch in dataloader:
        training_client.forward_backward(batch, "cross_entropy")
        training_client.optim_step(learning_rate=1e-4)

# Save checkpoint
training_client.save_state("my-lora-checkpoint")

Tinker Client

For compatibility with existing Tinker code:

import os
from twinkle import init_tinker_client

# Patch Tinker SDK
init_tinker_client()

# Now use Tinker API as usual
from tinker import ServiceClient, types

service_client = ServiceClient(
    base_url='http://localhost:8000',
    api_key=os.environ.get('API_KEY')
)

training_client = service_client.create_lora_training_client(
    base_model='Qwen/Qwen3.5-4B',
    rank=16
)

# ... rest of Tinker training code

Inference / Sampling

After training, use your LoRA for inference:

from twinkle.data_format import Message, Trajectory
from twinkle.template import Template

# Create sampling client with trained LoRA
sampling_client = service_client.create_sampling_client(
    model_path='twinkle://my-lora-checkpoint',
    base_model='Qwen/Qwen3.5-4B'
)

# Prepare prompt
template = Template(model_id='ms://Qwen/Qwen3.5-4B')
trajectory = Trajectory(
    messages=[
        Message(role='system', content='You are a helpful assistant'),
        Message(role='user', content='Who are you?'),
    ]
)

input_feature = template.encode(trajectory, add_generation_prompt=True)
prompt = types.ModelInput.from_ints(input_feature['input_ids'].tolist())

# Sample
params = types.SamplingParams(
    max_tokens=128,
    temperature=0.7
)

result = sampling_client.sample(prompt=prompt, sampling_params=params)
print(template.decode(result.sequences[0].tokens))

Cookbook Examples

Complete examples in cookbook/client/:

cookbook/client/
├── server/                         # Server configurations
│   ├── transformer/
│   │   ├── server.py
│   │   └── server_config.yaml
│   └── megatron/
│       ├── server.py
│       └── server_config.yaml
├── twinkle/                        # Twinkle Client examples
│   ├── self_host/
│   │   ├── grpo.py                 # GRPO training
│   │   ├── sample.py               # Inference
│   │   └── self_cognition.py       # SFT training
│   └── modelscope/
│       └── self_cognition.py
└── tinker/                         # Tinker Client examples
    ├── self_host/
    │   ├── lora.py
    │   ├── sample.py
    │   └── short_math_grpo.py
    └── modelscope/
        ├── sample.py
        └── self_cognition.py

Running

# 1. Start Server
python cookbook/client/server/megatron/server.py

# 2. Run Client (in another terminal)
# Tinker Client
python cookbook/client/tinker/self_host/self_cognition.py

# Or Twinkle Client
python cookbook/client/twinkle/self_host/self_cognition.py
docs