Offline DPO and SFT#
This example describes DPO and SFT based on the Qwen2.5-1.5B-Instruct model.
Step 1: Model and Data Preparation#
Model Preparation#
Download the Qwen2.5-1.5B-Instruct model to the local directory $MODEL_PATH/Qwen2.5-1.5B-Instruct:
# Using Modelscope
modelscope download Qwen/Qwen2.5-1.5B-Instruct --local_dir $MODEL_PATH/Qwen2.5-1.5B-Instruct
# Using Huggingface
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir $MODEL_PATH/Qwen2.5-1.5B-Instruct
More details of model downloading are referred to ModelScope or Huggingface.
Data Preparation#
For DPO, we download the Human-like-DPO-dataset to the local directory $DATASET_PATH/human_like_dpo_dataset:
# Using Modelscope
modelscope download --dataset HumanLLMs/Human-Like-DPO-Dataset --local_dir $DATASET_PATH/human_like_dpo_dataset
# Using Huggingface
huggingface-cli download HumanLLMs/Human-Like-DPO-Dataset --repo-type dataset --local-dir $DATASET_PATH/human_like_dpo_dataset
More details of dataset downloading are referred to ModelScope or Huggingface.
Note that the dataset has the keys prompt, chosen and rejected. If not, pass the proper keys to the config.
For SFT, we download the dataset to the local directory /PATH/TO/SFT_DATASET/, which usually contains message-based data.
Step 2: Setup Configuration#
Configuration for DPO#
We use the configurations in dpo.yaml and train_dpo.yaml for this experiment. Some important setups are listed in the following:
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config mode to train and pass the data path to the trainer.
project: <project_name>
name: <experiment_name>
mode: train
algorithm:
algorithm_type: dpo
kl_loss_fn: k1
kl_loss_fn_args:
kl_coef: 0.1 # value of beta in DPO
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct
cluster:
node_num: 1
gpu_per_node: 8
buffer:
total_epochs: 2
batch_size: 64
trainer_input:
experience_buffer:
name: human_like_dpo
storage_type: file
path: $DATASET_PATH/human_like_dpo_dataset
format:
prompt_type: plaintext # plaintext/messages/chatpair
prompt_key: prompt
chosen_key: chosen
rejected_key: rejected
trainer:
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
save_interval: 30
Configuration for SFT#
We set the algorithm_type as sft to run SFT process. Then we modify the config file sft.yaml with the following changes:
project: <project_name>
name: <experiment_name>
mode: train
algorithm:
algorithm_type: sft
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: /PATH/TO/MODEL/
cluster:
node_num: 1
gpu_per_node: 2
buffer:
total_epochs: 5
batch_size: 64
trainer_input:
experience_buffer:
name: <sft_dataset_name>
storage_type: file
path: /PATH/TO/SFT_DATASET/
split: train
format:
prompt_type: messages
messages_key: messages
trainer:
trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
save_interval: 50
Step 3: Run the Experiment#
Run DPO process with the following command:
trinity run --config examples/dpo_humanlike/dpo.yaml
or, for SFT:
trinity run --config /PATH/TO/sft.yaml