The original document is Chinese Version
TLDR: Tongyi open-sources a new generation of cutting-edge and easy-to-use Agentic Reinforcement Learning framework, AgentJet (AJet). AgentJet features fully distributed Swarm Training capabilities, achieving complete decoupling of training and inference. It significantly simplifies the training process for single-agent and multi-agent LLM systems, enabling more efficient training of complex multi-agent systems.
On one hand, in AgentJet, researchers can use very simple code to connect multiple different LLM models simultaneously into a multi-agent system RL training task, achieving true non-shared parameter Multi-Agent Reinforcement Learning (MARL). On the other hand, researchers can run agents directly participating in training on any device (such as a laptop), and can dynamically add, remove, or modify agent Rollout nodes at any time. This builds a swarm training network that is unrestricted by the environment, allows for bug fixing on the fly, and can self-heal from external environment crashes. Furthermore, AgentJet is completely open-source, rich in examples, ready to use out of the box, and open for co-construction. It comes with token-level tracing debugging tools & a version-by-version training performance tracking platform. It also provides relevant skills (SKILLs) for Vibe Coding developers, allowing tools like Claude Code to assist in your agent orchestration and training debugging with one click.
The Dilemma of Centralized Agentic LLM RL Architecture
In the past year of 2025, we witnessed the rapid development of Large Language Model Agents. However, as LLM agents and their supporting tools and runtimes become increasingly complex, both agent developers and frontier LLM reinforcement learning researchers encounter various frustrating problems:
- Just as you were about to celebrate the initial success of Agent training, an external API balance was unexpectedly exhausted, causing the training to abort.
- You only simply modified the reward, but had to wait forever for the training to restart, and all progress since the last checkpoint was lost.
- A certain Agent requires docker as a runtime, but due to insufficient permissions, you cannot start other containers, so you have to spend a lot of time modifying the Agent source code to find a workaround.
- MCP tool failures (browser MCP tools blocked by IP, database MCP tools failing due to unexpected disk fullness).
- Remote connection to the server to debug the Agent is inconvenient. How great it would be if you could run the Agent on your own laptop and directly participate in (full parameter) Agent RL training.
When too much energy is wasted on the stability of the Agent runtime, it becomes increasingly difficult to make "bold" algorithmic attempts under the constraints of existing frameworks:
- Why can't we train models of different sizes simultaneously in multi-agent tasks, doing non-parameter sharing multi-agent RL training?
- If a smaller model learns multiple completely different Agent Workflows, or even tasks in completely different domains, at each Gradient Step simultaneously, is it possible to perform better?
- Why is there rarely research using complex Agents with complex Runtimes like opencode directly for training?
Where there are difficulties, there are solutions. In the past year, VERL solved the problem of training and inference GPU efficiency, Agent-Lightning solved the problem of flexible access to custom agents, and Tinker proposed a semi-distributed decoupled architecture (unfortunately, it can only train LoRA models). The Tongyi EconML team, building on the architecture of these projects, has taken a critical step forward with AgentJet: We propose a brand new multi-agent LLM swarm distributed training mode. In this framework: On one hand, it supports any number of swarm-server nodes hosting any number of models (e.g., 7B + 14B + 32B) to provide vLLM (or SGLang) inference + policy gradient updates. On the other hand, it supports any number of swarm-client nodes hosting any Agent workflow and any Agent runtime.
AgentJet Swarm: The First Open-Source Swarm Distributed LLM Agent Training Framework
Previous Agentic RL training modes had some implicit assumptions: - First, no matter how many agents are in the task to be trained, these agents can only share the same fine-tunable LLM model (shared "brain"). The reason for this phenomenon is that most training backends represented by VERL and TRL typically configure only one LLM model for fine-tuning. - Second, in the reinforcement learning sample collection stage, all current training frameworks forcibly bind the agent Rollout task process. That is, all tasks must be initiated by a single training backend, use a single model for inference, traverse tasks from the same dataset, and be constrained by the same operating system environment.
AgentJet Swarm pioneers a brand new distributed swarm training framework. In this framework, the entire training system consists of several nodes, divided into two categories: Swarm Server and Swarm Client:
- Swarm Server: Runs on a GPU server (or cluster), loads the LLM policy parameters being trained, maintains the training/inference CoLocate environment, provides vLLM/SGLang API interfaces (with automatic context tracking & timeline merging capabilities), and executes policy gradient calculations.
- Swarm Client: Runs on any device, reads datasets, runs reinforcement learning sampling tasks, and finally returns reward signals to the Swarm Server. It can also remotely control the Swarm Server at any time to update its training parameters, remotely start, stop, or restart training at will.
To visualize the difference between the two training modes, you can compare the training backend (like VERL) to an "aircraft carrier" with sufficient computing power but carrying only one model, and the RL inference sampling process to "fighter jets".
- The scheduling of these "fighter jets" is completely "welded" to the "mother ship" that created them, unable to use the models and computing power of other external "mother ships". After completing the task, they need to be recovered and recycled by the "mother ship". In addition, as "carrier-based aircraft", the "weight" of the runtime is constrained by the "mother ship", inevitably requiring "cutting the feet to fit the shoes", investing a lot of time in modifying the MCP environment and agent runtime environment. Furthermore, this "mother ship"-centric sample sampling method is very fragile. Once the external environment changes (such as external API failure, IP rate limiting, disk full) or internal parameters need to be modified (such as reward coefficients, task difficulty coefficients), the entire training process must be completely terminated and retried, losing all unsaved progress. In 2026, as agents become more complex, the trouble this brings to complex training tasks is clearly unbearable.
- In contrast, AgentJet's swarm training framework creates a new training method. For a training task, researchers can deploy multiple "mother ships" as needed to carry multiple LLM model training requirements, and then launch "fighter jets" from any platform (such as workstations, servers, or even your Macbook, without any restrictions on hardware, operating system, dependency environment, or programming language, as long as it can send HTTP requests) to complete RL sampling tasks. These "fighter jets" executing sampling tasks can freely use the models and computing power of all "mother ships" in the swarm, and can dynamically join or exit the training task at any time. Researchers can even designate one of the "fighter jets" as a "super commander", responsible for remotely controlling the operation of all "aircraft carriers" in the swarm and transmitting training parameters, forming a many-to-many, scalable and flexible training system.
Next, let's use a few simple cases to demonstrate the advantages of the AgentJet swarm mode.
Flexible Swarm Training Mode
Full Parameter Training of Agentic LLM Models on a Laptop
Yes, in AgentJet swarm mode, your laptop can perfectly become a Swarm Client. Imagine this scenario: your team has deployed a Swarm Server on a remote GPU cluster, mounting a Qwen-32B model. At this time, open your Laptop, write the Agent Loop you need to train, specify the dataset path, model path, and reward function, and debugging and training can begin.
Your laptop (or workstation, Alibaba Cloud ECS, etc., no GPU required) is only responsible for running the logic orchestration of the Agent workflow: reading datasets, calling the remote Swarm Server's inference interface (Base Url + Api Key) to get model output, executing tool calls, calculating rewards, and then sending the results back to the Swarm Server.
On the other hand, all the heavy lifting (model inference, gradient calculation, parameter update) is completed by the remote GPU cluster.
What does this mean? Agent developers and large model researchers no longer need to clearly distinguish between the boundaries of "inference" and "training", nor do they need to struggle to debug workflows in a dedicated training pipeline. You can write and modify Agent logic locally using your most familiar IDE. Without terminating the training, you can also achieve instant modification of agent code and reward parameters at any time. For example, when you need to modify the reward, just modify the code, kill the running old Swarm Client process and restart it. (The Swarm Server will automatically clean up the data debris left by the previous Swarm Client.)
Because AgentJet swarm mode realizes instant feedback of agent code and reward modifications in the training system, you can even let advanced programming assistance tools like Claude Code or Cursor take over the entire process of Agent Loop writing + debugging + training, writing http commands to remotely adjust Swarm Server training parameters.
Although essentially different, a comparison can be made with Tinker in the reinforcement learning field. AgentJet, which is fully open-source and open, is more controllable and flexible in this field.
| Feature | Tinker | AgentJet-Swarm |
|---|---|---|
| Open Source Nature | ❌ Closed Source | ✅ Open Source & Free |
| Pricing Model | Paid Service | ✅ Completely Free |
| Task | Various LLM Training | Specialized in Agent RL Training |
| Architecture Mode | Managed Service + Single Point Client API | ✅ Both Server and Client are scalable on demand |
| Multi-Client Participation | ❌ Not Supported | ✅ Supported |
| Training Method | LoRA Fine-tuning only | ✅ Full LLM Model Training |
| Max Model Scale | Llama 70B, Qwen 235B | ✅ Depends on user GPU cluster config |
| Communication Protocol | Proprietary API | ✅ Proprietary API + OpenAI Compatible API |
| Inference Engine Backend | Built-in unknown inference service | ✅ vLLM/SGLang optional |
External Runtime Crash? Fix it and Continue, Don't Waste a Second
This is one of the most intuitive engineering benefits brought by the AgentJet swarm architecture. Training crashes caused by unstable external factors may have become a collective memory of many Agent reinforcement learning researchers. In traditional centralized training frameworks, Agent runtime and training loops are tightly coupled. Once an external dependency fails. For example, a browser MCP tool is IP banned by the target website, a code sandbox Docker container is killed due to OOM, or even just a third-party API Rate Limit is triggered, the entire training process has a probability of crashing. Then you have to reload from the last checkpoint, lose all unsaved rollout data, and pray for better luck next time.
AgentJet Swarm cures this problem from the architectural level. Since Swarm Client and Swarm Server are completely decoupled independent processes, the crash of a Client is just "one less data provider" for the Server. The Server will continue to wait for data from other Clients, or patiently wait for the faulty Client to recover.
Specifically:
- Client Crash: Rollout samples already collected by the Server will not be lost; they are safely stored in the Server's sample buffer. You just need to fix the problem (change IP, restart Docker, recharge API balance), and then restart the Client. It will automatically continue to submit new rollouts from the breakpoint.
- Partial Task Failure: Even if some tasks in a batch fail due to runtime failures, AgentJet will gracefully skip these failed samples and use the successfully completed samples to continue gradient updates, without wasting any effective computation.
For Agent training tasks that rely on complex external environments (web browsing, terminal operations, database interactions), this fault tolerance capability is not just icing on the cake, but a necessity.
Fixing Workflow BUG? Debugging Rewards? Get Traceback in 10 Seconds
AgentJet achieves true trinity of training, inference, and debugging. Under traditional frameworks, debugging a reward function for an Agent workflow is a frustrating thing. You modify a line of reward calculation logic, and then you need to: restart the entire training script -> wait for model loading (tens of seconds to minutes) -> wait for vLLM engine initialization -> wait for the first rollout to complete -> finally see the error message. The entire cycle may take 5-10 minutes, and you might have just made a mistake in indentation.
In swarm mode, this pain point is completely eliminated. Because the Swarm Client is a lightweight pure CPU process, it does not need to load any model weights, and the startup time is in seconds. Your debugging cycle becomes:
- Modify workflow code or reward function in IDE (VS Code, Cursor, etc.)
- Restart Swarm Client (about 2-3 seconds)
- Client immediately connects to the already running Swarm Server and starts executing new rollouts
- See results or traceback within seconds
This means you can develop Agent training processes just like developing ordinary Python projects - set breakpoints, check variables, single-step execution. The entire Client side is ordinary Python code, without Ray or any other "black magic" of distributed training frameworks. AI programming assistants like Cursor and Claude Code can also directly participate in your Agent training development and benefit from the instant output feedback of the Agent to automatically fix Bugs.
Multi-Task Cocktail Training: Need to RL train 40% Math Tasks, 30% Code Tasks, 30% Terminal Tasks simultaneously, with completely different Runtimes? No Problem!
Multi-task mixed training is a key means to improve model generalization capabilities, but it is fraught with difficulties in practice. Math tasks require a symbolic calculation verifier, code tasks require a safe Docker sandbox, and terminal tasks require a complete Linux environment and file system - the dependencies, permission requirements, and security policies of these three runtimes are completely different. Stuffing them into the same training process is troublesome and unsafe.
AgentJet swarm mode naturally solves this problem. You only need to deploy a Swarm Server to host the target model, and then start multiple Swarm Clients on different machines (or even different network environments), each Client responsible for a type of task. Next, you can use the "throttler" provided by AgentJet to adjust the ratio of different tasks; you can also flexibly customize the training logic and dynamically adjust the ratio during the training process. Each Client operates independently, is independently fault-tolerant, and does not interfere with each other.
This architecture also brings an additional benefit: Resource Isolation. Code sandbox needs Docker permissions? Configure it on Machine B, it won't affect other Client machines. Browser MCP tools need special network proxies? Configure only on the corresponding Client machine. The security boundaries and resource requirements of different tasks are naturally isolated.
Single Node-Multi Model: One Agent Workflow training with two heterogeneous models together? No problem, define the reward function, start immediately!
Multi-agent collaboration is one of the frontier directions of Agent research, but existing frameworks almost always assume that all Agents share the same underlying model. This assumption is unreasonable in many scenarios: an Agent responsible for high-level planning may need a 32B large model to ensure reasoning quality, while an Agent responsible for specific execution may be sufficient with a 7B small model.
AgentJet Swarm natively supports multi-Server multi-model training topology. You can start multiple Swarm Servers simultaneously on multiple GPU servers, each Server hosting models of different sizes, and then use a Swarm Client to orchestrate their collaboration:
In the workflow, the Client can route different inference requests to different Servers based on roles. The conversation history of the planning Agent is sent to the 32B model, and the conversation history of the execution Agent is sent to the 7B model. The two models collect their own rollout samples, calculate gradients independently, update parameters independently, and complete true non-shared parameter multi-agent reinforcement learning training.
This capability opens up many research directions that were previously difficult to realize:
- Heterogeneous Team Game: Models of different ability levels form teams to learn optimal strategies respectively in competitive or cooperative environments.
- Cascaded Decision Optimization: Large models are responsible for coarse-grained decisions, and small models execute fine-grained operations, jointly optimizing the entire decision chain end-to-end.
- Teacher-Student Collaborative Training: Large model acts as a teacher to provide high-quality planning, small model acts as a student to learn execution, and both evolve together through RL signals.
Efficient Training/Inference GPU CoLocate Based on VERL
The flexibility of the AgentJet swarm architecture does not come at the expense of GPU utilization efficiency & generating large GPU bubbles. Inside the Swarm Server, AgentJet still adopts the battle-tested VERL training/inference CoLocate architecture: this means that inference (rollout generation) and training (gradient update) share the same group of GPUs, avoiding waste of GPU memory.
For researchers familiar with VERL, almost all algorithm implementations implemented by VERL can be applied to AgentJet losslessly. AgentJet adds a swarm communication layer and timeline merging optimization on this basis, but the core training logic remains consistent. Migration costs are low, and performance is guaranteed.
Agnostic to Agent Framework, Supports OpenAI Protocol BaseUrl and ApiKey
AgentJet is not bound to any specific Agent framework. Whether you use LangChain, AutoGen, CrewAI, MetaGPT, or your own handwritten Agent logic based on raw HTTP requests, as long as your Agent calls LLM via OpenAI compatible API protocol (base_url + api_key), seamless access to AgentJet for training is possible.
For your Agent code, Swarm Server is no different from any other OpenAI compatible inference service. The only difference is that AgentJet silently records complete conversation context and token-level information for training in the background.
This means you can take existing, already debugged Agent workflows directly for RL training without rewriting any inference call logic. Even for some closed-source Agent black-box agents, theoretically, you only need to modify the base_url and api_key in the environment variables to access AgentJet for training.
Stable, Reproducible, Version-by-Version Performance Tracking, No Worries
For a training framework, "it runs" is just the minimum requirement. "Runs correctly" and "runs stably" are what researchers really care about. AgentJet has invested heavily in engineering quality to ensure that every training result is trustworthy.
Version-by-Version Performance Tracking: We maintain a public Performance Tracking Dashboard, continuously recording AgentJet's training curves and final performance on multiple standard tasks (mathematical reasoning, code generation, tool use, etc.), across major Git versions, and across different training backends (VERL, etc.). With every code update, the test bot executes benchmarks, and any performance regression is immediately detected. This means: - When upgrading AgentJet versions, you can clearly know how the new version performs on the tasks you care about. - If an update introduces a hidden bug causing a decline in training effectiveness, we will capture it immediately. - Researchers can confidently cite AgentJet's experimental results because they are reproducible.
Token Consistency Automatic Alert & Repair: In Agent training, a hidden issue is token drift: the same text may be encoded into different token sequences during inference and training phases, leading to incorrect logprob calculations and thus polluting the accuracy of policy gradient calculations. AgentJet has built-in automatic re-tokenization drift detection and repair mechanisms, enabled by default. It automatically verifies and corrects token sequences before each rollout sample enters the training pipeline, eliminating such problems.
High-Resolution Logs: When in-depth diagnosis of training behavior is needed, AgentJet provides token-level rollout logs, recording the ID, loss mask status, and logprob value of each token. This information is crucial for understanding model learning dynamics, troubleshooting reward signal anomalies, and verifying workflow logic correctness.
A Powerful Training Framework
For an Agent training framework, implementing a distributed architecture is far from enough. How to provide a stable, easy-to-start, and trustworthy training environment is also a topic we need to study. Therefore, AgentJet possesses and open-sources these core hard capabilities:
- Rich Tutorial Library: Provides interesting examples as tutorial materials. Explore our rich example library to quickly start your journey:
- Timeline Automatic Merging Capability: Supports multi-agent workflows and adopts context merging technology to accelerate training by 1.5x to 10x in multi-turn (or multi-agent) conversation scenarios. (Similar to the "tree structure" processing capability mentioned in the minimax forge technical report.)
- Reliable and Reproducible: We continuously track the framework's performance on multiple different tasks + major Git versions + different training backends (data continuously aggregated), what you see is what you get, hidden bugs are discovered in seconds.
- Token Consistency Automatic Alert & Repair: By default, AgentJet automatically performs Re-tokenization drift repair based on the Token ID returned by the vLLM engine.
- Multi-Training Backend Support: Supports multiple training backends using VERL, and is working on supporting other training backends like TRL.
Conclusion and Outlook
The core philosophy of the AgentJet swarm training framework can be summarized in one sentence: Let the flexibility of Agent training match the complexity of the Agent itself.
As Agent workflows become more complex, rely on more external tools, and involve more heterogeneous models, the training framework should not become a bottleneck. By completely decoupling training inference (Server) from Agent runtime (Client), AgentJet achieves:
- Developer Friendly: Debug Agent workflows on a laptop with IDE, connect to remote swarm for instant training.
- Engineering Robustness: External runtime failures do not affect training progress, seamlessly resume after repair.
- Algorithm Flexibility: Multi-task mixed training, heterogeneous multi-model collaborative training, dynamic data ratio adjustment, everything is configurable.
- Reliable Performance: Inherits VERL's efficient CoLocate architecture, supplemented by timeline merging acceleration techniques, and version-by-version performance tracking capabilities to ensure trustworthy results.
We believe that when the training framework is no longer a limiting factor, researchers and engineers can devote more energy to truly important things - designing better Agent architectures, exploring more effective reward signals, and trying bolder multi-agent collaboration strategies.
AgentJet is fully open-sourced on GitHub. Researchers and developers in the community are welcome to try, feedback, and contribute. Let's push LLM Agent training into the swarm era together.
Project Address: https://github.com/modelscope/AgentJet
Performance Dashboard: https://benchmark.agentjet.top/
Official Documentation: https://modelscope.github.io/AgentJet/