Example Summary#

From the Dataset Perspective

This guide provides an example list from the dataset perspective, where you can find out what datasets the examples have covered easily.

Dataset	Algorithm	Use Case	References
openai/gsm8k	GRPO	Regular RFT	example, doc
	GRPO	Asynchronous training	example, doc
	Multi-Step GRPO	AgentScope ReAct agent training	example, doc
	AsymRE	Regular RFT	example
	CISPO	Regular RFT	example
	GRPO	Training with prioritized tasks	example, doc
	GRPO	Training with reward reshaping on experiences	example, doc
	GRPO	Training with RULER (Relative Universal LLM-Elicited Rewards)	example
	GRPO	Training a policy model as its own reward model	example
	GRPO	Training using LoRA	example
	OPMD	Off-policy RFT	example, doc
	REC	Training with group-relative reinforce variants	example
	sPPO	Training with sPPO algorithm	example
	TOPR	Tapered off-policy RFT	example
Math category tasks	GRPO	Training with rewards from RM-Gallery	example
	AsymRE	Regular RFT	example
	MIX	Training with “expert” data generated by a more advanced LLM	example, doc
ALFWorld	GRPO	Concatenated multi-turn RFT	example, doc
	Multi-Step GRPO	General multi-step RFT	example, doc
SciWorld	GRPO	Concatenated multi-turn RFT	example
WebShop	GRPO	Concatenated multi-turn RFT	example, doc
callanwu/WebWalkerQA	Multi-Step GRPO	Multi-turn web search agent training	example
corbt/enron-emails	Multi-Step GRPO	Multi-turn email search agent training	example, doc
open-r1/DAPO-Math-17k-Processed	GRPO	Regular RFT	example
LLM360/guru-RL-92k	GRPO	Training with bayesian online task selection	example
Frozen Lake	GRPO	Concatenated multi-turn RFT	example
anisha2102/RaR-Medicine	GRPO	Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task	example
Team-ACE/ToolACE	GRPO	Regular RFT for tool calling	example
hiyouga/geometry3k	GRPO	Regular RFT for VLM	example
	MIX	Training with “expert” data generated by a more advanced LLM	example
datajuicer/RealMedConv	GRPO	Regular RFT for learning to ask in a proactive way	example
datajuicer/Trinity-ToolAce-RL-split	CHORD	Training with dynamic SFT + RL integration	example
datajuicer/Trinity-ToolAce-SFT-split	CHORD	Training with dynamic SFT + RL integration	example
Jiayi-Pan/Countdown-Tasks-3to4	PPO	Training based on the critic model	example
	PPO	Training with Megatron-LM as the backend.	example
	PPO	Training with experience replay	example
open-r1/Mixture-of-Thoughts	SFT	Regular SFT	example, doc
HumanLLMs/Human-Like-DPO-Dataset	DPO	Training based on prepared human preferences	example, doc
toy dataset	DPO	Training based on human-in-the-loop real-time preference annotation	example, doc