Offline DPO and SFT
This example describes DPO and SFT based on the Qwen2.5-1.5B-Instruct model.
Step 1: Model and Data Preparation
Model Preparation
Download the Qwen2.5-1.5B-Instruct model to the local directory $MODEL_PATH/Qwen2.5-1.5B-Instruct
:
# Using Modelscope
modelscope download Qwen/Qwen2.5-1.5B-Instruct --local_dir $MODEL_PATH/Qwen2.5-1.5B-Instruct
# Using Huggingface
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir $MODEL_PATH/Qwen2.5-1.5B-Instruct
More details of model downloading are referred to ModelScope or Huggingface.
Data Preparation
For DPO, we download the Human-like-DPO-dataset to the local directory $DATASET_PATH/human_like_dpo_dataset
:
# Using Modelscope
modelscope download --dataset HumanLLMs/Human-Like-DPO-Dataset --local_dir $DATASET_PATH/human_like_dpo_dataset
# Using Huggingface
huggingface-cli download HumanLLMs/Human-Like-DPO-Dataset --repo-type dataset --local-dir $DATASET_PATH/human_like_dpo_dataset
More details of dataset downloading are referred to ModelScope or Huggingface.
Note that the dataset has the keys prompt
, chosen
and rejected
. If not, pass the proper keys to the config.
For SFT, we download the dataset to the local directory /PATH/TO/SFT_DATASET/
, which usually contains message-based data.
Step 2: Setup Configuration
Configuration for DPO
We use the configurations in dpo.yaml
and train_dpo.yaml
for this experiment. Some important setups are listed in the following:
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config mode
to train
and pass the data path to the trainer.
project: <project_name>
name: <experiment_name>
mode: train
algorithm:
algorithm_type: dpo
kl_loss_fn: k1
kl_loss_fn_args:
kl_coef: 0.1 # value of beta in DPO
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct
cluster:
node_num: 1
gpu_per_node: 8
buffer:
total_epochs: 2
batch_size: 64
trainer_input:
experience_buffer:
name: human_like_dpo
storage_type: file
path: $DATASET_PATH/human_like_dpo_dataset
format:
prompt_type: plaintext # plaintext/messages/chatpair
prompt_key: prompt
chosen_key: chosen
rejected_key: rejected
trainer:
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
save_interval: 30
Configuration for SFT
We set the algorithm_type
as sft
to run SFT process. Then we modify the config file sft.yaml
with the following changes:
project: <project_name>
name: <experiment_name>
mode: train
algorithm:
algorithm_type: sft
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: /PATH/TO/MODEL/
cluster:
node_num: 1
gpu_per_node: 2
buffer:
total_epochs: 5
batch_size: 64
trainer_input:
experience_buffer:
name: <sft_dataset_name>
storage_type: file
path: /PATH/TO/SFT_DATASET/
split: train
format:
prompt_type: messages
messages_key: messages
trainer:
trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
save_interval: 50
Step 3: Run the Experiment
Run DPO process with the following command:
trinity run --config examples/dpo_humanlike/dpo.yaml
or, for SFT:
trinity run --config /PATH/TO/sft.yaml