Example Summary#
From the Dataset Perspective
This guide provides an example list from the dataset perspective, where you can find out what datasets the examples have covered easily.
Dataset |
Algorithm |
Use Case |
References |
|---|---|---|---|
GRPO |
Regular RFT |
||
GRPO |
Asynchronous training |
||
Multi-Step GRPO |
AgentScope ReAct agent training |
||
AsymRE |
Regular RFT |
||
CISPO |
Regular RFT |
||
GRPO |
Training with prioritized tasks |
||
GRPO |
Training with reward reshaping on experiences |
||
GRPO |
Training with RULER (Relative Universal LLM-Elicited Rewards) |
||
GRPO |
Training a policy model as its own reward model |
||
GRPO |
Training using LoRA |
||
OPMD |
Off-policy RFT |
||
REC |
Training with group-relative reinforce variants |
||
sPPO |
Training with sPPO algorithm |
||
TOPR |
Tapered off-policy RFT |
||
Math category tasks |
GRPO |
Training with rewards from RM-Gallery |
|
AsymRE |
Regular RFT |
||
MIX |
Training with “expert” data generated by a more advanced LLM |
||
GRPO |
Concatenated multi-turn RFT |
||
Multi-Step GRPO |
General multi-step RFT |
||
GRPO |
Concatenated multi-turn RFT |
||
GRPO |
Concatenated multi-turn RFT |
||
Multi-Step GRPO |
Multi-turn web search agent training |
||
Multi-Step GRPO |
Multi-turn email search agent training |
||
GRPO |
Regular RFT |
||
GRPO |
Training with bayesian online task selection |
||
GRPO |
Concatenated multi-turn RFT |
||
GRPO |
Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task |
||
GRPO |
Regular RFT for tool calling |
||
GRPO |
Regular RFT for VLM |
||
MIX |
Training with “expert” data generated by a more advanced LLM |
||
GRPO |
Regular RFT for learning to ask in a proactive way |
||
CHORD |
Training with dynamic SFT + RL integration |
||
CHORD |
Training with dynamic SFT + RL integration |
||
PPO |
Training based on the critic model |
||
PPO |
Training with Megatron-LM as the backend. |
||
PPO |
Training with experience replay |
||
SFT |
Regular SFT |
||
DPO |
Training based on prepared human preferences |
||
toy dataset |
DPO |
Training based on human-in-the-loop real-time preference annotation |