Offline DPO
This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and Human-like-DPO-dataset.
Step 1: Model and Data Preparation
Model Preparation
Download the Qwen-2.5-1.5B-Instruct model to the local directory $MODEL_PATH/Qwen2.5-1.5B-Instruct
:
# Using Modelscope
modelscope download Qwen/Qwen2.5-1.5B-Instruct --local_dir $MODEL_PATH/Qwen2.5-1.5B-Instruct
# Using Huggingface
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir $MODEL_PATH/Qwen2.5-1.5B-Instruct
More details of model downloading are referred to ModelScope or Huggingface.
Data Preparation
Download the Human-Like-DPO-Dataset dataset to the local directory $DATASET_PATH/human_like_dpo_dataset
:
# Using Modelscope
modelscope download --dataset HumanLLMs/Human-Like-DPO-Dataset --local_dir $DATASET_PATH/human_like_dpo_dataset
# Using Huggingface
huggingface-cli download HumanLLMs/Human-Like-DPO-Dataset --repo-type dataset --local-dir $DATASET_PATH/human_like_dpo_dataset
More details of dataset downloading are referred to ModelScope or Huggingface.
Note that the dataset has the keys prompt
, chosen
and rejected
. If not, pass the proper keys to the config.
Step 2: Setup Configuration and Run Experiment
Configuration
We use the configurations in dpo.yaml
and train_dpo.yaml
for this experiment. Some important setups are listed in the following:
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config mode
to train
and pass the data path to the trainer.
project: <project_name>
name: <experiment_name>
mode: train
algorithm:
algorithm_type: dpo
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: /PATH/TO/MODEL/
cluster:
node_num: 1
gpu_per_node: 8
buffer:
total_epochs: 2
batch_size: 64
trainer_input:
experience_buffer:
name: dpo_buffer
storage_type: file
path: /PATH/TO/DATASET/
format:
prompt_type: plaintext # plaintext/messages/chatpair
prompt_key: prompt
chosen_key: chosen
rejected_key: rejected
trainer:
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
save_interval: 30
actor_use_kl_loss: True
actor_kl_loss_coef: 0.1 # value of beta in DPO
Run the Experiment
Run RFT process with the following command:
trinity run --config examples/dpo_humanlike/dpo.yaml