Example Summary

Example Summary#

From the Dataset Perspective

This guide provides an example list from the dataset perspective, where you can find out what datasets the examples have covered easily.

Dataset

Algorithm

Use Case

References

openai/gsm8k

GRPO

Regular RFT

example, doc

GRPO

Asynchronous training

example, doc

Multi-Step GRPO

AgentScope ReAct agent training

example, doc

AsymRE

Regular RFT

example

CISPO

Regular RFT

example

GRPO

Training with prioritized tasks

example, doc

GRPO

Training with reward reshaping on experiences

example, doc

GRPO

Training with RULER (Relative Universal LLM-Elicited Rewards)

example

GRPO

Training a policy model as its own reward model

example

GRPO

Training using LoRA

example

OPMD

Off-policy RFT

example, doc

REC

Training with group-relative reinforce variants

example

sPPO

Training with sPPO algorithm

example

TOPR

Tapered off-policy RFT

example

Math category tasks

GRPO

Training with rewards from RM-Gallery

example

AsymRE

Regular RFT

example

MIX

Training with “expert” data generated by a more advanced LLM

example, doc

ALFWorld

GRPO

Concatenated multi-turn RFT

example, doc

Multi-Step GRPO

General multi-step RFT

example, doc

SciWorld

GRPO

Concatenated multi-turn RFT

example

WebShop

GRPO

Concatenated multi-turn RFT

example, doc

callanwu/WebWalkerQA

Multi-Step GRPO

Multi-turn web search agent training

example

corbt/enron-emails

Multi-Step GRPO

Multi-turn email search agent training

example, doc

open-r1/DAPO-Math-17k-Processed

GRPO

Regular RFT

example

LLM360/guru-RL-92k

GRPO

Training with bayesian online task selection

example

Frozen Lake

GRPO

Concatenated multi-turn RFT

example

anisha2102/RaR-Medicine

GRPO

Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task

example

Team-ACE/ToolACE

GRPO

Regular RFT for tool calling

example

hiyouga/geometry3k

GRPO

Regular RFT for VLM

example

MIX

Training with “expert” data generated by a more advanced LLM

example

datajuicer/RealMedConv

GRPO

Regular RFT for learning to ask in a proactive way

example

datajuicer/Trinity-ToolAce-RL-split

CHORD

Training with dynamic SFT + RL integration

example

datajuicer/Trinity-ToolAce-SFT-split

CHORD

Training with dynamic SFT + RL integration

example

Jiayi-Pan/Countdown-Tasks-3to4

PPO

Training based on the critic model

example

PPO

Training with Megatron-LM as the backend.

example

PPO

Training with experience replay

example

open-r1/Mixture-of-Thoughts

SFT

Regular SFT

example, doc

HumanLLMs/Human-Like-DPO-Dataset

DPO

Training based on prepared human preferences

example, doc

toy dataset

DPO

Training based on human-in-the-loop real-time preference annotation

example, doc