Twinkle vs veRL: Two Approaches to LLM Post-Training
Reinforcement Learning from Human Feedback (RLHF) and its variants have become essential for aligning LLMs. Two excellent open-source frameworks in this space are veRL (from ByteDance Seed team) and Twinkle (from ModelScope). Both are production-ready and support diverse training scenarios. In this post, we compare their architectural philosophies and help you choose the right tool for your needs.
Overview
Both veRL and Twinkle are mature, production-ready frameworks for LLM post-training. They share many capabilities but differ in architectural philosophy:
| Aspect | veRL | Twinkle |
|---|---|---|
| Architecture | Hybrid-controller (HybridFlow) | Client-Server decoupled |
| Core Strength | RL algorithm richness | Multi-tenant unified platform |
| Backends | FSDP, Megatron-LM, vLLM, SGLang | Transformers, Megatron |
| Hardware | NVIDIA, AMD, Ascend | NVIDIA, Ascend |
| Deployment | Ray cluster | torchrun / Ray / HTTP (TaaS) |
Architecture Comparison
veRL: Hybrid-Controller Architecture
veRL implements the HybridFlow paper’s hybrid-controller design, optimizing dataflow between training and inference phases:
┌─────────────────────────────────────────────┐
│ veRL Hybrid Controller │
│ ┌────────────┐ ┌────────────┐ ┌─────────┐ │
│ │ Rollout │ │ Training │ │ Reward │ │
│ │ (vLLM/SGL) │──│ (FSDP/ │──│ Model │ │
│ │ │ │ Megatron) │ │ │ │
│ └────────────┘ └────────────┘ └─────────┘ │
│ 3D-HybridEngine: Efficient Resharding │
└─────────────────────────────────────────────┘
Key strengths:
- 3D-HybridEngine: Eliminates memory redundancy during training/generation transitions
- Rich RL algorithms: PPO, GRPO, DAPO, VAPO, REINFORCE++, RLOO, PRIME, and more
- Inference engine integration: First-class vLLM and SGLang support
- Proven at scale: Used to train Doubao-1.5-pro, achieving O1-level math performance
Twinkle: Client-Server Decoupled Architecture
Twinkle separates concerns into client (data/logic) and server (model/compute) components:
┌──────────────┐ ┌──────────────────────────┐
│ Client │ │ Server Cluster │
│ ┌────────┐ │ │ ┌─────────────────────┐ │
│ │Dataset │ │────▶│ │ Base Model │ │
│ │Template│ │ │ ├─────────────────────┤ │
│ │ Loss │ │ │ │ LoRA A │ LoRA B │...│ │
│ └────────┘ │ │ └─────────────────────┘ │
└──────────────┘ └──────────────────────────┘
Key strengths:
- Multi-tenancy: Multiple LoRA training jobs on a shared base model
- HTTP/TaaS mode: Deploy as a service, train via API calls
- Unified platform: SFT, PT, and RL on the same infrastructure
- Explicit training loop: Full control over each training step
Feature Comparison
RL Algorithms
| Algorithm | veRL | Twinkle |
|---|---|---|
| PPO | ✅ | ✅ |
| GRPO | ✅ | ✅ |
| DAPO / VAPO | ✅ | - |
| REINFORCE++ | ✅ | - |
| RLOO | ✅ | ✅ |
| GKD | ✅ | ✅ |
| Multi-turn RL | ✅ | ✅ |
Training Capabilities
| Feature | veRL | Twinkle |
|---|---|---|
| SFT | ✅ | ✅ |
| Pre-training | ✅ | ✅ |
| LoRA | ✅ | ✅ |
| VLM / Multimodal | ✅ (Qwen2.5-VL, Kimi-VL) | Planned |
| Multi-turn + Tools | ✅ | ✅ |
| Multi-tenancy | - | ✅ |
Scale & Performance
| Aspect | veRL | Twinkle |
|---|---|---|
| Max tested scale | 671B (DeepSeek), hundreds of GPUs | 72B+, Ray clusters |
| Inference engines | vLLM, SGLang, HF | vLLM, HF |
| Training backends | FSDP, FSDP2, Megatron-LM | Transformers, Megatron |
When to Choose veRL
veRL excels when:
- You need state-of-the-art RL algorithms (DAPO, VAPO, REINFORCE++)
- VLM/multimodal RL is a requirement
- You want vLLM/SGLang as your inference engine for rollouts
- You’re pushing the frontier of RL research for reasoning models
- You need proven scale (671B models, O1-level results)
When to Choose Twinkle
Twinkle excels when:
- Multi-tenancy is critical (multiple teams, concurrent training jobs)
- You need a unified SFT → RL pipeline with one infrastructure
- Training-as-a-Service (TaaS) deployment via HTTP is important
- You want explicit training loop control for custom logic
- Pre-training is part of your workflow
Code Style Comparison
veRL: Declarative Trainer
# veRL style - configure and run
from verl import DataProto
from verl.trainer.ppo import PPOTrainer
trainer = PPOTrainer(
config=config,
actor_rollout_ref=actor,
critic=critic,
reward_model=reward_fn,
)
trainer.fit()
Twinkle: Explicit Training Loop
# Twinkle style - explicit control
from twinkle import TransformersModel
model = TransformersModel(model_id=model_id)
model.add_adapter_to_model('default', lora_config)
model.set_optimizer(optimizer_cls='AdamW', lr=1e-4)
for batch in dataloader:
model.forward_backward(inputs=batch)
# Custom logic here
model.clip_grad_and_step()
Conclusion
Both veRL and Twinkle are excellent choices for LLM post-training. They represent different design philosophies:
- veRL: Optimized for RL performance and algorithm diversity, with cutting-edge research support
- Twinkle: Optimized for operational flexibility, multi-tenancy, and unified training workflows
The good news? Both are open source, actively maintained, and production-ready. Choose based on your primary use case:
| Your Priority | Recommended |
|---|---|
| Cutting-edge RL algorithms | veRL |
| VLM/multimodal training | veRL |
| Multi-tenant platform | Twinkle |
| TaaS deployment | Twinkle |
| Unified SFT+RL infra | Twinkle |
Resources
veRL:
Twinkle: