This guide explains how to launch the AgentJet Swarm Server inside a Docker container. The Swarm Server is the GPU-side component responsible for gradient computation, and weight updates. It exposes an OpenAI-compatible API that Swarm Clients connect to for training.

Not familiar with Swarm? Read the Swarm Introduction first.

Prerequisites

Requirement Detail
Docker With GPU support (nvidia-container-toolkit)
AgentJet Docker image ghcr.io/modelscope/agentjet:main (built from the AgentJet repository)
LLM model weights Downloaded locally (e.g., Qwen2.5-7B-Instruct)

Command Template

Run the command below:

docker run --rm -it \
  -v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct \
  -v ./swarmlog:/workspace/log \
  -v ./swarmexp:/workspace/saved_experiments \
  -p 10086:10086 \
  --gpus=all \
  --shm-size=32GB \
  ghcr.io/modelscope/agentjet:main \
  bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)"

And when completed, you will see a interface like this, which means the deployment is successful:

image




Flag / Argument What it does
--rm Automatically remove the container when it exits. Keeps things clean.
-it Allocates an interactive TTY. Required for the ajet-swarm overwatch TUI monitor to render correctly inside the container.
-v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct Model mount — mounts your local model weights directory into the container. The path inside the container must match the model field you configure in your training job.
-v ./swarmlog:/workspace/log Log mount — mounts a local ./swarmlog directory to persist server logs outside the container. The VERL training log is written here.
-p 10086:10086 Port mapping — exposes port 10086 so that Swarm Clients on other machines can reach the server via http://<server-ip>:10086.
ghcr.io/modelscope/agentjet:main The AgentJet Docker image.
bash -c "..." Runs two processes concurrently inside the container (see below).




The Two Processes Inside bash -c

The command launches two background processes with &:

(ajet-swarm overwatch)
&
(NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)
Process What it does
ajet-swarm overwatch Starts the real-time TUI monitor in the foreground. Displays the current server state (OFFLINE / BOOTING / ROLLING / WEIGHT_SYNCING), active episodes, and rollout statistics.
ajet-swarm start Starts the Swarm Server itself — initializes VERL training loop, vLLM inference engine, and the FastAPI HTTP server on port 10086.
NO_COLOR=1 LOGURU_COLORIZE=NO Disables ANSI color codes in the server log so the log file swarm_server.log is readable as plain text.
&>/workspace/log/swarm_server.log Redirects both stdout and stderr of the server process to the log file (which is persisted to your host machine via the volume mount).




Concrete Example

The following example mounts a model downloaded at host directory /root/agentjet/modelscope_cache/Qwen/Qwen2___5-7B-Instruct, and we would like to mount it at container directory: /mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct

docker run --rm -it \
  -v /root/agentjet/modelscope_cache/Qwen/Qwen2___5-7B-Instruct:/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct \
  -v ./swarmlog:/workspace/log \
  -v ./swarmexp:/workspace/saved_experiments \
  -p 10086:10086 \
  --gpus=all \
  --shm-size=32GB \
  ghcr.io/modelscope/agentjet:main \
  bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)"

Make sure the container-side path matches whatever model path you specify in your AgentJetJob.

What Happens After Launch

image

Once the container starts, you will see the ajet-swarm overwatch TUI in your terminal. The server begins in OFFLINE state and transitions through:

OFFLINE → BOOTING → ROLLING → WEIGHT_SYNCING → ROLLING → ...

The server only moves to BOOTING after a Swarm Client sends it a training configuration and calls start_engine(). Until then it waits safely in OFFLINE.

Meanwhile, all VERL and training logs stream into ./swarmlog/swarm_server.log on your host machine.

Connecting a Swarm Client

From any machine (no GPU required) that can reach the server on port 10086, run your Swarm Client:

from ajet.tuner_lib.experimental.as_swarm_client import SwarmClient
from ajet.copilot.job import AgentJetJob

swarm_worker = SwarmClient("http://<server-ip>:10086")
swarm_worker.auto_sync_train_config_and_start_engine(
    AgentJetJob(
        algorithm="grpo",
        n_gpu=8,
        model="/mnt/data_cpfs/model_cache/modelscope/hub/Qwen/Qwen/Qwen2.5-7B-Instruct",
        batch_size=32,
        num_repeat=4,
    )
)

The model path here must be the container-side path (right-hand side of the -v mount), not the host path.

See Swarm Best Practices for full client examples.

Troubleshooting

Symptom Likely Cause Fix
Server stays OFFLINE forever No client has called start_engine() Run your Swarm Client script to send the training config
Model not found error in log Container-side model path is wrong Verify the right-hand side of your -v flag matches the model field in AgentJetJob
Client cannot connect to port 10086 Firewall or wrong IP Check server firewall rules; use ajet-swarm overwatch --swarm-url=http://<ip>:10086 to test connectivity
Log file is empty ./swarmlog directory doesn't exist on host Create it first: mkdir -p ./swarmlog