Server and Client | Twinkle

Overview

Mon, 01 Jan 0001 00:00:00 +0000

Twinkle provides a complete HTTP Server/Client architecture that supports deploying models as services and remotely calling them through clients to complete training, inference, and other tasks. This architecture decouples model hosting (Server side) and training logic (Client side), allowing multiple users to share the same base model for training.

Core Concepts

Server side: Deployed based on Ray Serve, hosts model weights and inference/training computation. The Server is responsible for managing model loading, forward/backward propagation, weight saving, sampling inference, etc. A single Server simultaneously supports both Twinkle Client and Tinker Client connections.
Client side: Runs locally, responsible for data preparation, training loop orchestration, hyperparameter configuration, etc. The Client communicates with the Server via HTTP, sending data and commands.

Model Backends

Model loading supports three backends:

Backend	backend	Description
Transformers	`transformers`	Based on HuggingFace Transformers, suitable for most scenarios
Megatron	`megatron`	Based on Megatron-LM, suitable for ultra-large-scale model training, supports more efficient parallelization strategies
Mock	`mock`	Numpy-only mock backend for CPU-only development and testing

Two Client Modes

Client	Initialization Method	Description
Twinkle Client	`init_twinkle_client`	Native client, simply change `from twinkle import` to `from twinkle_client import` to migrate local training code to remote calls
Tinker Client	`init_tinker_client`	Patches Tinker SDK, allowing existing Tinker training code to be directly reused

How to Choose

Client Mode Selection

Scenario	Recommendation
Existing Twinkle local training code, want to switch to remote	Twinkle Client — only need to change import paths
Existing Tinker training code, want to reuse	Tinker Client — only need to initialize patch
New project	Twinkle Client — simpler API

Model Backend Selection

Scenario	Recommendation
7B/14B and other medium-small scale models	Transformers backend (`backend: transformers`)
Ultra-large-scale models requiring advanced parallelization strategies	Megatron backend (`backend: megatron`)
Rapid experimentation and prototype verification	Transformers backend (`backend: transformers`)
CPU-only development/testing	Mock backend (`backend: mock`)

Cookbook Reference

Complete runnable examples are located in the cookbook/ directory:

cookbook/
├── observability/ # Observability (Grafana + OTLP)
│ ├── docker-compose.yaml # One-command LGTM stack
│ └── README.md
├── client/
│ ├── server/ # Server startup configuration
│ │ ├── transformer/ # Transformers backend
│ │ │ ├── run.sh
│ │ │ ├── server_config.yaml
│ │ │ └── server_config_e2e.yaml
│ │ ├── megatron/ # Megatron backend
│ │ │ ├── run.sh
│ │ │ ├── server_config.yaml
│ │ │ └── server_config_4b.yaml
│ │ └── mock/ # Mock backend (CPU-only quick start)
│ │ └── server_config.yaml
├── twinkle/ # Twinkle Client examples
│ ├── self_host/ # Self-hosted Server
│ │ ├── dpo.py # DPO training client
│ │ ├── multi_modal.py # Multi-modal training client
│ │ ├── sample.py # Inference sampling client
│ │ ├── self_congnition.py # Self-cognition training client
│ │ └── short_math_grpo.py # GRPO math training client
│ └── modelscope/ # ModelScope managed service
│ ├── dpo.py
│ ├── multi_modal.py
│ └── self_congnition.py
└── tinker/ # Tinker Client examples
 ├── self_host/ # Self-hosted Server
 │ ├── dpo.py # DPO training client
 │ ├── lora.py # LoRA training client
 │ ├── multi_modal.py # Multi-modal training client
 │ ├── sample.py # Inference sampling client
 │ ├── self_cognition.py # Self-cognition training client
 │ └── short_math_grpo.py # GRPO math training client
 └── modelscope/ # ModelScope managed service
 ├── dpo.py
 ├── sample.py
 ├── self_cognition.py
 └── short_math_grpo.py

Running steps:

# 1. Start Server first
twinkle-server launch -c cookbook/client/server/transformer/server_config.yaml

# 2. Run Client in another terminal (Tinker Client example)
python cookbook/client/tinker/self_host/self_cognition.py

# Or use Twinkle Client
python cookbook/client/twinkle/self_host/self_cognition.py

Server

Mon, 01 Jan 0001 00:00:00 +0000

Ray Cluster Configuration

Before starting the Server, you must first start and configure the Ray nodes. Only after the Ray nodes are properly configured can the Server correctly allocate and occupy resources (GPU, CPU, etc.).

Starting Ray Nodes

A Ray cluster consists of multiple nodes, each of which can be configured with different resources. The startup steps are as follows:

1. Start the Head Node (First GPU Node)

# Stop existing Ray cluster (if any)
ray stop

# Start the Head node with GPU 0-3, 4 GPUs in total
CUDA_VISIBLE_DEVICES=0,1,2,3 ray start --head --num-gpus=4 --port=6379

2. Start Worker Nodes

# Second GPU node, using GPU 4-7, 4 GPUs in total
CUDA_VISIBLE_DEVICES=4,5,6,7 ray start --address=10.28.252.9:6379 --num-gpus=4

# CPU node (for running Processor and other CPU tasks)
ray start --address=10.28.252.9:6379 --num-gpus=0

Notes:

--head: Marks this node as the Head node (the primary node of the cluster)
--port=6379: The port the Head node listens on
--address=<IP>:<PORT>: The address for Worker nodes to connect to the Head node
--num-gpus=N: The number of GPUs available on this node
CUDA_VISIBLE_DEVICES: Restricts the GPU devices visible to this node

3. Complete Example: 3-Node Cluster

# Stop the old cluster and start a new one
ray stop && \
CUDA_VISIBLE_DEVICES=0,1,2,3 ray start --head --num-gpus=4 --port=6379 && \
CUDA_VISIBLE_DEVICES=4,5,6,7 ray start --address=10.28.252.9:6379 --num-gpus=4 && \
ray start --address=10.28.252.9:6379 --num-gpus=0

This configuration starts 3 nodes:

Node 0 (Head): 4 GPUs (cards 0-3)
Node 1 (Worker): 4 GPUs (cards 4-7)
Node 2 (Worker): CPU-only node

4. Set Environment Variables

Before starting the Server, you need to set the following environment variables:

export TWINKLE_TRUST_REMOTE_CODE=0 # Whether to trust remote code (security consideration)

Node Rank in YAML Configuration

In the YAML configuration file, each component needs to occupy a separate Node.

Example configuration:

applications:
 # Model service occupies GPU 0-3 (physical card numbers)
 - name: models-Qwen3.5-4B
 route_prefix: /models/Qwen/Qwen3.5-4B
 import_path: model
 args:
 nproc_per_node: 4
 device_group:
 name: model
 ranks: 4 # Number of GPUs to use
 device_type: cuda
 device_mesh:
 device_type: cuda
 dp_size: 4 # Data parallel size
 # tp_size: 1 # Tensor parallel size (optional)
 # pp_size: 1 # Pipeline parallel size (optional)
 # ep_size: 1 # Expert parallel size (optional)

 # Sampler service occupies GPU 4-5 (physical card numbers)
 - name: sampler-Qwen3.5-4B
 route_prefix: /sampler/Qwen/Qwen3.5-4B
 import_path: sampler
 args:
 nproc_per_node: 2
 device_group:
 name: sampler
 ranks: 2 # Number of GPUs to use
 device_type: cuda
 device_mesh:
 device_type: cuda
 dp_size: 2 # Data parallel size

 # Processor service occupies CPU
 - name: processor
 route_prefix: /processors
 import_path: processor
 args:
 ncpu_proc_per_node: 4
 device_group:
 name: processor
 ranks: 0 # CPU index
 device_type: CPU
 device_mesh:
 device_type: CPU
 dp_size: 4 # Data parallel size

Important notes:

The ranks configuration specifies the number of GPUs to allocate for the component
The device_mesh configuration uses parameters like dp_size, tp_size, pp_size, ep_size to define the parallelization strategy
Different components will be automatically assigned to different Nodes
Ray will automatically schedule to the appropriate Node based on resource requirements (num_gpus, num_cpus in ray_actor_options)

Startup Methods

The Server is launched via the CLI command with a YAML configuration file. Installing Twinkle registers the twinkle-server command.

Launch the Server

twinkle-server launch --config server_config.yaml

Or via the Python module:

python -m twinkle.server launch --config server_config.yaml

CLI Subcommands

Subcommand	Description
`launch`	Start the Server (blocks until shutdown)
`check-config`	Validate a config file without starting the server
`print-config`	Emit the validated, normalized config (`--format yaml\|json`)
`clear persistence`	Delete persisted state from the configured backend

Common parameters:

Parameter	Description	Environment Variable
`-c, --config`	YAML configuration file path (required)	`TWINKLE_SERVER_CONFIG`
`--namespace`	Ray namespace (`launch` only)	`TWINKLE_RAY_NAMESPACE`

Examples:

# Validate config (useful in CI to catch misconfigurations)
twinkle-server check-config -c server_config.yaml

# View the fully resolved config
twinkle-server print-config -c server_config.yaml --format json

# Clear persisted state (Redis or file)
twinkle-server clear persistence -c server_config.yaml

YAML Configuration Details

The configuration file defines the complete deployment plan for the Server, including HTTP listening, application components, and resource allocation. The Server simultaneously supports both Twinkle and Tinker clients through a unified configuration file.

Complete Configuration Example (Megatron Backend)

# HTTP proxy location: EveryNode means running one proxy per Ray node (recommended for multi-node scenarios)
proxy_location: EveryNode

# HTTP listening configuration
http_options:
 host: 0.0.0.0 # Listen on all network interfaces
 port: 8000 # Service port number

# Observability: push traces/metrics/logs via OTLP
telemetry:
 enabled: true
 otlp_endpoint: http://localhost:4317

# Persistence: storage backend for ServerState (sessions, models, futures, etc.)
# mode: memory | file | redis
persistence:
 mode: file
 file_path: /tmp/twinkle_state.json

# Application list: Each entry defines a service component deployed on the Server
applications:

 # 1. TinkerCompatServer: Central API service
 # Handles client connections, training run tracking, checkpoint management, etc.
 # route_prefix uses /api/v1, compatible with both Tinker and Twinkle clients
 - name: server
 route_prefix: /api/v1
 import_path: server
 args:
 server_config:
 per_token_model_limit: 3 # Maximum number of models (adapters) per token (server-globally enforced)
 supported_models:
 - Qwen/Qwen3.5-4B
 deployments:
 - name: TinkerCompatServer
 max_ongoing_requests: 50
 autoscaling_config:
 min_replicas: 1
 max_replicas: 1
 target_ongoing_requests: 128
 ray_actor_options:
 num_cpus: 0.1

 # 2. Model service: Hosts the base model
 # Executes forward propagation, backward propagation and other training computations
 - name: models-Qwen3.5-4B
 route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
 import_path: model
 args:
 backend: megatron # Model backend: transformers | megatron | mock
 model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
 max_length: 10240
 nproc_per_node: 2 # Number of GPU processes per node
 device_group: # Logical device group
 name: model
 ranks: 2 # Number of GPUs to use
 device_type: cuda
 device_mesh: # Distributed training mesh
 device_type: cuda
 dp_size: 2 # Data parallel size
 queue_config:
 rps_limit: 100 # Max requests per second
 tps_limit: 10000 # Max tokens per second per user
 max_input_tokens: 10000 # Maximum input tokens per request
 adapter_config:
 adapter_timeout: 30 # Idle adapter timeout unload time (seconds)
 adapter_max_lifetime: 36000 # Maximum adapter lifetime (seconds)
 max_loras: 1 # Maximum number of LoRA adapters per model
 deployments:
 - name: ModelManagement
 autoscaling_config:
 min_replicas: 1
 max_replicas: 1
 target_ongoing_requests: 16
 ray_actor_options:
 num_cpus: 0.1
 runtime_env:
 env_vars:
 TWINKLE_TRUST_REMOTE_CODE: "0"

 # 3. Sampler service: Inference sampling
 # Uses vLLM engine for inference, supports LoRA adapters
 - name: sampler-Qwen3.5-4B
 route_prefix: /api/v1/sampler/Qwen/Qwen3.5-4B
 import_path: sampler
 args:
 model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
 nproc_per_node: 2 # Number of GPU processes per node
 sampler_type: vllm # Inference engine: vllm (high performance) or torch
 engine_args: # vLLM engine parameters
 max_model_len: 4096 # Maximum sequence length
 gpu_memory_utilization: 0.5 # GPU memory usage ratio (0.0-1.0)
 enable_lora: true # Support loading LoRA during inference
 logprobs_mode: processed_logprobs  # Logprobs output mode
 device_group: # Logical device group
 name: sampler
 ranks: 1 # Number of GPUs to use
 device_type: cuda
 device_mesh:
 device_type: cuda
 dp_size: 1
 queue_config:
 rps_limit: 100 # Max requests per second
 tps_limit: 100000 # Max tokens per second
 deployments:
 - name: SamplerManagement
 autoscaling_config:
 min_replicas: 1
 max_replicas: 1
 target_ongoing_requests: 16
 ray_actor_options:
 num_cpus: 0.1
 runtime_env:
 env_vars:
 TWINKLE_TRUST_REMOTE_CODE: "0"

 # 4. Processor service: Data preprocessing
 # Executes tokenization, template conversion, and other preprocessing tasks on CPU
 - name: processor
 route_prefix: /api/v1/processor
 import_path: processor
 args:
 ncpu_proc_per_node: 2
 device_group:
 name: model
 ranks: 2
 device_type: CPU
 device_mesh:
 device_type: CPU
 dp_size: 2
 deployments:
 - name: ProcessorManagement
 autoscaling_config:
 min_replicas: 1
 max_replicas: 1
 target_ongoing_requests: 128
 ray_actor_options:
 num_cpus: 0.1

Transformers Backend

The difference from the Megatron backend is only in the backend parameter of the Model service:

 - name: models-Qwen3.5-4B
 route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
 import_path: model
 args:
 backend: transformers  # Use Transformers backend
 model_id: "ms://Qwen/Qwen3.5-4B"
 nproc_per_node: 2
 device_group:
 name: model
 ranks: 2
 device_type: cuda
 device_mesh:
 device_type: cuda
 dp_size: 2
 adapter_config:
 adapter_timeout: 1800 # Idle adapter timeout unload time (seconds)
 adapter_max_lifetime: 36000
 deployments:
 - name: ModelManagement
 autoscaling_config:
 min_replicas: 1
 max_replicas: 1
 target_ongoing_requests: 16
 ray_actor_options:
 num_cpus: 0.1

Configuration Item Description

Top-Level Fields

Field	Description
`proxy_location`	HTTP proxy location (`EveryNode` or `HeadOnly`)
`http_options`	HTTP listener config (`host`, `port`)
`telemetry`	Observability config (`enabled`, `otlp_endpoint`)
`persistence`	State persistence config (`mode`, `file_path`, `redis_url`)
`applications`	Application component list

The config file uses strict validation (extra='forbid'). Any misspelled field name will be rejected before startup. Use twinkle-server check-config -c xxx.yaml to detect errors early.

Application Components (import_path)

import_path	Description
`server`	Central management service, handles training runs and checkpoints
`model`	Model service, hosts base model for training
`processor`	Data preprocessing service, executes tokenization and template conversion on CPU
`sampler`	Inference sampling service

Model Backend (backend)

backend	Description
`transformers`	Based on HuggingFace Transformers, suitable for most scenarios
`megatron`	Based on Megatron-LM, suitable for ultra-large-scale model training
`mock`	Numpy-only mock backend for CPU-only development and testing

device_group and device_mesh

device_group: Defines logical device groups, specifying how many GPUs to use
device_mesh: Defines distributed training mesh, controls parallelization strategy

device_group:
 name: model  # Device group name
 ranks: 2 # Number of GPUs to use
 device_type: cuda # Device type: cuda / CPU

device_mesh:
 device_type: cuda
 dp_size: 2 # Data parallel size
 # tp_size: 1 # Tensor parallel size (optional)
 # pp_size: 1 # Pipeline parallel size (optional)
 # ep_size: 1 # Expert parallel size (optional)

Important configuration parameters:

Parameter	Type	Description
`ranks`	int	Number of GPUs to use for this component
`dp_size`	int	Data parallel size
`tp_size`	int (optional)	Tensor parallel size
`pp_size`	int (optional)	Pipeline parallel size
`ep_size`	int (optional)	Expert parallel size (for MoE models)

telemetry

Controls the OpenTelemetry observability pipeline. See for details.

Field	Type	Default	Description
`enabled`	bool	`false`	Whether to enable telemetry
`service_name`	str	`twinkle-server`	Reported service name
`otlp_endpoint`	str	`http://localhost:4317`	OTel Collector gRPC address
`debug`	bool	`false`	When `true`, dumps to console instead of OTLP

persistence

Storage backend for ServerState (sessions, models, futures, etc.).

Field	Type	Default	Description
`mode`	str	`memory`	`memory` / `file` / `redis`
`file_path`	str	—	Required for `file` mode, JSON file path
`redis_url`	str	—	Required for `redis` mode, e.g. `redis://localhost:6379`
`key_prefix`	str	`""`	Optional global key prefix

Environment variables:

export TWINKLE_TRUST_REMOTE_CODE=0 # Whether to trust remote code

Configuration Validation and Migration

The config file uses strict validation. The following scenarios trigger errors before startup:

Misspelled or unsupported field names
Type mismatches (e.g., passing a string for port)
Cross-field constraints not met (e.g., persistence.mode: redis without redis_url)

# Validate only, do not start
twinkle-server check-config -c server_config.yaml

Migrating from old configuration:

Old Field	New Field
`use_megatron: true`	`backend: megatron`
`use_megatron: false`	`backend: transformers`

Additionally, this refactor introduces two new top-level fields — telemetry and persistence — which did not exist before. Add them as needed.

Observability

Mon, 01 Jan 0001 00:00:00 +0000

Twinkle Server provides full observability through OpenTelemetry, covering traces, metrics, and logs.

Quick Start

1. Start the Observability Stack

The project includes a one-command Docker Compose setup based on the grafana/otel-lgtm image (bundles OTel Collector, Mimir, Tempo, Loki, and Grafana):

cd cookbook/observability
docker compose up -d

Available services after startup:

Service	URL	Purpose
Grafana	`http://localhost:3000`	Dashboards and data exploration
OTLP gRPC	`localhost:4317`	Point Twinkle’s `otlp_endpoint` here
OTLP HTTP	`localhost:4318`	Same, HTTP alternative

2. Configure the Server

Enable telemetry in server_config.yaml:

telemetry:
 enabled: true
 otlp_endpoint: http://localhost:4317

3. Install Dependencies

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

4. Launch the Server

twinkle-server launch -c server_config.yaml

5. Open Grafana

Navigate to http://localhost:3000. Default credentials: admin / admin.

telemetry Configuration Fields

Field	Type	Default	Description
`enabled`	bool	`false`	Whether to enable the telemetry pipeline
`service_name`	str	`twinkle-server`	Reported service name
`otlp_endpoint`	str	`http://localhost:4317`	OTel Collector gRPC address
`debug`	bool	`false`	When `true`, dumps spans/metrics to console instead of OTLP
`export_interval_ms`	int	`30000`	Metrics export interval (milliseconds)
`resource_attributes`	dict	`{}`	Additional resource attributes attached to all telemetry

Built-in Grafana Dashboard

The provisioned Twinkle Server Overview dashboard includes:

HTTP request rate and P95 latency per deployment (Gateway / Model / Sampler / Processor)
Active resource counts (sessions, models, sampling sessions, futures)
Task queue depth, execution P95, wait-time P95
Rate-limit rejections and task completions by status

Metric Naming Reference

Twinkle uses dot-notation OpenTelemetry metric names. Prometheus OTLP ingestion converts dots to underscores and appends _total to monotonic counters:

OpenTelemetry Name	Prometheus Name
`twinkle.http.requests.total`	`twinkle_http_requests_total`
`twinkle.http.request.duration_seconds`	`twinkle_http_request_duration_seconds_bucket`
`twinkle.queue.depth`	`twinkle_queue_depth`
`twinkle.task.execution_seconds`	`twinkle_task_execution_seconds_bucket`
`twinkle.task.wait_seconds`	`twinkle_task_wait_seconds_bucket`
`twinkle.rate_limit.rejections.total`	`twinkle_rate_limit_rejections_total`
`twinkle.tasks.total`	`twinkle_tasks_total`
`twinkle.sessions.active`	`twinkle_sessions_active`
`twinkle.models.active`	`twinkle_models_active`
`twinkle.sampling_sessions.active`	`twinkle_sampling_sessions_active`
`twinkle.futures.active`	`twinkle_futures_active`

The *.active resource gauges report absolute values. Do NOT wrap them with rate() or increase().

Tracing

Twinkle spans are namespaced under twinkle.server.<component> (Gateway / Model / Sampler / Processor). Each request carries twinkle.session_id and trace_id correlation keys, supporting end-to-end cross-deployment tracing.

In Grafana, switch the datasource to Tempo to search traces by service name or span name.

Production Deployment

The LGTM all-in-one image in cookbook/observability is for local development and demos only. For production:

Deploy Mimir / Tempo / Loki / Grafana separately with persistent storage and replicas
Place an independent OTel Collector tier in front for sampling and routing
The telemetry config and metric names in server_config.yaml transfer without changes

Troubleshooting

Grafana shows “No data”

Confirm telemetry.enabled: true in your config
Confirm worker logs show Worker telemetry initialized
Set debug: true to verify spans appear in the console, then switch back to debug: false

Twinkle can’t reach the Collector

otlp_endpoint must be reachable from the Twinkle process. If Twinkle runs in a separate container, use the Docker network address e.g. http://twinkle-lgtm:4317

Resource gauges stuck at 0

Only the cleanup-leader worker pushes resource counts. If gauges remain at 0 for longer than export_interval_ms × 2 after startup, check logs for “became cleanup leader” messages

Tear Down

cd cookbook/observability
docker compose down -v # -v removes the data volume as well

Twinkle Client

Mon, 01 Jan 0001 00:00:00 +0000

Twinkle Client is the native client, designed with the philosophy: Change from twinkle import to from twinkle_client import, and you can migrate local training code to remote calls without modifying the original training logic.

Initialization

from twinkle_client import init_twinkle_client

# Initialize client, connect to Twinkle Server
client = init_twinkle_client(
 base_url='http://127.0.0.1:8000', # Server address
 api_key='your-api-key' # Authentication token (can be set via environment variable TWINKLE_SERVER_TOKEN)
)

After initialization, the client object (TwinkleClient) provides the following management functions:

# Health check
client.health_check()

# List current user's training runs
runs = client.list_training_runs(limit=20)

# Get specific training run details
run = client.get_training_run(run_id='xxx')

# List checkpoints
checkpoints = client.list_checkpoints(run_id='xxx')

# Get checkpoint path (for resuming training)
path = client.get_checkpoint_path(run_id='xxx', checkpoint_id='yyy')

# Get latest checkpoint path
latest_path = client.get_latest_checkpoint_path(run_id='xxx')

Migrating from Local Code to Remote

Migration is very simple, just replace the import path from twinkle to twinkle_client:

# Local training code (original)
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset
from twinkle.model import MultiLoraTransformersModel

# Remote training code (after migration)
# DataLoader and Dataset can be imported from either local twinkle or remote twinkle_client
from twinkle.dataloader import DataLoader # or: from twinkle_client.dataloader import DataLoader
from twinkle.dataset import Dataset # or: from twinkle_client.dataset import Dataset
from twinkle_client.model import MultiLoraTransformersModel

Training loops, data processing, and other logic do not need any modifications.

Complete Training Example (Transformers Backend)

import dotenv
dotenv.load_dotenv('.env')

from peft import LoraConfig
from twinkle import get_logger
from twinkle.dataset import DatasetMeta
from twinkle_client import init_twinkle_client

# DataLoader and Dataset can be imported from either local twinkle or remote twinkle_client
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset
from twinkle_client.model import MultiLoraTransformersModel

logger = get_logger()

base_model = 'Qwen/Qwen3.5-4B'
base_url = 'http://localhost:8000'
api_key = 'EMPTY_API_KEY'

# Step 1: Initialize client
client = init_twinkle_client(base_url=base_url, api_key=api_key)

# List available models on the server
print('Available models:')
for item in client.get_server_capabilities().supported_models:
 print('- ' + item.model_name)

# Step 2: Query existing training runs (optional, for resuming training)
runs = client.list_training_runs()
resume_path = None
for run in runs:
 logger.info(run.model_dump_json(indent=2))
 checkpoints = client.list_checkpoints(run.training_run_id)
 for checkpoint in checkpoints:
 logger.info(checkpoint.model_dump_json(indent=2))
 # Uncomment to resume from checkpoint:
 # resume_path = checkpoint.twinkle_path

# Step 3: Prepare dataset
# data_slice limits the number of samples loaded
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))

# Set chat template to match model's input format
dataset.set_template('Qwen3_5Template', model_id=f'ms://{base_model}', max_length=512)

# Data preprocessing: Replace placeholders with custom names
dataset.map('SelfCognitionProcessor',
 init_args={'model_name': 'twinkle model', 'model_author': 'ModelScope Team'})

# Encode dataset into tokens usable by the model
dataset.encode(batched=True)
# For large datasets, use num_proc to enable multi-process parallelism:
# dataset.encode(batched=True, num_proc=8)
# When using twinkle_client.dataset, encode calls the remote server over HTTP
# with a default 600s timeout; raise it via the timeout argument if needed:
# dataset.encode(batched=True, num_proc=8, timeout=3600)

# Create DataLoader
dataloader = DataLoader(dataset=dataset, batch_size=4)

# Step 4: Configure model
model = MultiLoraTransformersModel(model_id=f'ms://{base_model}')

# Configure LoRA: apply low-rank adapters to all linear layers
lora_config = LoraConfig(target_modules='all-linear')
# gradient_accumulation_steps=2: accumulate gradients over 2 micro-batches before each optimizer step
model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2)

# Set template, processor, loss function
model.set_template('Qwen3_5Template')
model.set_processor('InputProcessor', padding_side='right')
model.set_loss('CrossEntropyLoss')

# Set optimizer (only Adam is supported if the server uses Megatron backend)
model.set_optimizer('Adam', lr=1e-4)

# Set LR scheduler (not supported if the server uses Megatron backend)
# model.set_lr_scheduler('LinearLR')

# Step 5: Resume training (optional)
start_step = 0
if resume_path:
 logger.info(f'Resuming from checkpoint {resume_path}')
 progress = model.resume_from_checkpoint(resume_path)
 dataloader.resume_from_checkpoint(progress['consumed_train_samples'])
 start_step = progress['cur_step']

# Step 6: Training loop
logger.info(model.get_train_configs().model_dump())

for epoch in range(3):
 logger.info(f'Starting epoch {epoch}')
 for cur_step, batch in enumerate(dataloader, start=start_step + 1):
 # Forward propagation + backward propagation
 model.forward_backward(inputs=batch)

 # Gradient clipping + optimizer update (equivalent to calling clip_grad_norm / step / zero_grad / lr_step in sequence)
 model.clip_grad_and_step()

 # Print metric every 2 steps (aligned with gradient_accumulation_steps)
 if cur_step % 2 == 0:
 metric = model.calculate_metric(is_training=True)
 logger.info(f'Current is step {cur_step} of {len(dataloader)}, metric: {metric.result}')

 # Step 7: Save checkpoint
 twinkle_path = model.save(
 name=f'twinkle-epoch-{epoch}',
 save_optimizer=True,
 consumed_train_samples=dataloader.get_state()['consumed_train_samples'],
 )
 logger.info(f'Saved checkpoint: {twinkle_path}')

# Step 8: Upload to ModelScope Hub (optional)
# YOUR_USER_NAME = "your_username"
# hub_model_id = f'{YOUR_USER_NAME}/twinkle-self-cognition'
# model.upload_to_hub(
# checkpoint_dir=twinkle_path,
# hub_model_id=hub_model_id,
# async_upload=False
# )

For checkpoint resumption, the recommended client-side flow is:

Query the server for an existing checkpoint path with client.list_checkpoints(...) or client.get_latest_checkpoint_path(...).
Call model.resume_from_checkpoint(resume_path) to restore weights, optimizer, scheduler, RNG, and progress metadata.
Call dataloader.resume_from_checkpoint(progress['consumed_train_samples']) to skip already-consumed samples.

This matches the end-to-end example in cookbook/client/twinkle/self_host/self_cognition.py.

Differences with Megatron Backend

When using the Megatron backend, the main differences in client code:

# Megatron backend does not need explicit loss setting (computed internally by Megatron)
# model.set_loss('CrossEntropyLoss') # Not needed

# Optimizer and LR scheduler use Megatron built-in defaults
model.set_optimizer('default', lr=1e-4)
model.set_lr_scheduler('default', lr_decay_steps=1000, max_lr=1e-4)

The rest of the data processing, training loop, checkpoint saving, and other code remains exactly the same.

Tinker Client

Mon, 01 Jan 0001 00:00:00 +0000

The Tinker Client is suitable for scenarios with existing Tinker training code. After initializing with init_tinker_client, it patches the Tinker SDK to point to the Twinkle Server, and the rest of the code can directly reuse existing Tinker training code.

Initialization

# Initialize Tinker client before importing ServiceClient
from twinkle import init_tinker_client

init_tinker_client()

# Use ServiceClient directly from tinker
from tinker import ServiceClient

service_client = ServiceClient(
 base_url='http://localhost:8000', # Server address
 api_key=os.environ.get('MODELSCOPE_TOKEN') # Recommended: set to ModelScope Token
)

# Verify connection: List available models on Server
for item in service_client.get_server_capabilities().supported_models:
 print("- " + item.model_name)

What does init_tinker_client do?

When calling init_tinker_client, the following operations are automatically executed:

Patch Tinker SDK: Bypass Tinker’s tinker:// prefix validation, allowing it to connect to standard HTTP addresses
Set Request Headers: Inject necessary authentication headers such as X-Ray-Serve-Request-Id and Authorization

After initialization, simply import from tinker import ServiceClient to connect to Twinkle Server, and all existing Tinker training code can be used directly without any modifications.

Complete Training Example

Note: DataLoader and Dataset in Tinker compatible mode only support local twinkle imports; twinkle_client is not supported.

import os
import numpy as np
from tqdm import tqdm
from tinker import types
from twinkle import init_tinker_client
from twinkle.dataloader import DataLoader
from twinkle.dataset import Dataset, DatasetMeta
from twinkle.preprocessor import SelfCognitionProcessor
from twinkle.server.common import input_feature_to_datum

# Step 1: Initialize Tinker client before importing ServiceClient
init_tinker_client()

from tinker import ServiceClient

base_model = 'Qwen/Qwen3.5-4B'
base_url = 'http://localhost:8000'
api_key = 'EMPTY_API_KEY'

# Step 2: Prepare dataset
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500)))
dataset.set_template('Qwen3_5Template', model_id=f'ms://{base_model}', max_length=256)
dataset.map(SelfCognitionProcessor('twinkle model', 'ModelScope Team'), load_from_cache_file=False)
dataset.encode(batched=True, load_from_cache_file=False)
dataloader = DataLoader(dataset=dataset, batch_size=8)

# Step 3: Initialize training client
service_client = ServiceClient(base_url=base_url, api_key=api_key)

# Create LoRA training client (rank=16 specifies the LoRA adapter rank)
training_client = service_client.create_lora_training_client(base_model=base_model, rank=16)

# Step 4: Training loop
for epoch in range(3):
 print(f'Epoch {epoch}')
 for step, batch in tqdm(enumerate(dataloader)):
 # Convert Twinkle's InputFeature to Tinker's Datum format
 input_datum = [input_feature_to_datum(input_feature) for input_feature in batch]

 # Send data to Server: forward + backward propagation
 fwdbwd_future = training_client.forward_backward(input_datum, 'cross_entropy')

 # Optimizer step: update model weights with Adam
 optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))

 # Wait for both operations to complete
 fwdbwd_result = fwdbwd_future.result()
 optim_result = optim_future.result()

 # Compute weighted average log-loss per token for monitoring
 logprobs = np.concatenate([output['logprobs'].tolist() for output in fwdbwd_result.loss_fn_outputs])
 weights = np.concatenate([example.loss_fn_inputs['weights'].tolist() for example in input_datum])
 print(f'Loss per token: {-np.dot(logprobs, weights) / weights.sum():.4f}')
 print(f'Training Metrics: {optim_result}')

 # Save a checkpoint after each epoch
 save_future = training_client.save_state(f'twinkle-lora-{epoch}')
 save_result = save_future.result()
 print(f'Saved checkpoint to {save_result.path}')

Inference Sampling

Tinker compatible mode supports inference sampling functionality (Server needs to have Sampler service configured).

Sampling from Training

After training is complete, you can directly create a sampling client from the training client:

# Save current weights and create sampling client
sampling_client = training_client.save_weights_and_get_sampling_client(name='my-model')

# Prepare inference input
prompt = types.ModelInput.from_ints(tokenizer.encode("English: coffee break\nPig Latin:"))
params = types.SamplingParams(
 max_tokens=20, # Maximum number of tokens to generate
 temperature=0.0, # Greedy sampling (deterministic output)
 stop=["\n"] # Stop when encountering newline
)

# Generate multiple completions
result = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=8).result()

for i, seq in enumerate(result.sequences):
 print(f"{i}: {tokenizer.decode(seq.tokens)}")

Sampling from Checkpoint

You can also load saved checkpoints for inference:

import os
from tinker import types
from twinkle import init_tinker_client
from twinkle.data_format import Message, Trajectory
from twinkle.template import Template

# Initialize Tinker client before importing ServiceClient
init_tinker_client()

from tinker import ServiceClient

base_model = 'Qwen/Qwen3.5-4B'
base_url = 'http://localhost:8000'
api_key = 'EMPTY_API_KEY'

service_client = ServiceClient(base_url=base_url, api_key=api_key)

# Create sampling client from saved checkpoint
sampling_client = service_client.create_sampling_client(
 model_path='twinkle://run_id/weights/checkpoint_name', # twinkle:// path of the checkpoint
 base_model=base_model
)

# Use Twinkle's Template to build multi-turn dialogue input
template = Template(model_id=f'ms://{base_model}')

trajectory = Trajectory(
 messages=[
 Message(role='system', content='You are a helpful assistant'),
 Message(role='user', content='What is your name?'),
 ]
)

input_feature = template.batch_encode([trajectory], add_generation_prompt=True)[0]
input_ids = input_feature['input_ids'].tolist()

prompt = types.ModelInput.from_ints(input_ids)
params = types.SamplingParams(
 max_tokens=50, # Maximum number of tokens to generate
 temperature=0.2, # Low temperature, more focused answers
)

# Generate multiple completions
print('Sampling...')
future = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=8)
result = future.result()

# Decode and print each response
print('Responses:')
for i, seq in enumerate(result.sequences):
 print(f'{i}: {repr(template.decode(seq.tokens))}')

Publishing Checkpoint to ModelScope Hub

After training is complete, you can publish checkpoints to ModelScope Hub through the REST client:

rest_client = service_client.create_rest_client()

# Publish checkpoint from tinker path
# Need to set a valid ModelScope token as api_key when initializing the client
rest_client.publish_checkpoint_from_tinker_path(save_result.path).result()
print("Published checkpoint to ModelScope Hub")