diff --git a/trl/trainer/dpo_trainer.py b/trl/trainer/dpo_trainer.py index 4b45b5617d..bf83bda8b2 100644 --- a/trl/trainer/dpo_trainer.py +++ b/trl/trainer/dpo_trainer.py @@ -63,7 +63,7 @@ class DPOTrainer(Trainer): Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized. beta (`float`, defaults to 0.1): - The beta factor in DPO loss. Higher beta means less divergence from the initial policy. + The beta factor in DPO loss. Higher beta means less divergence from the initial policy. For the IPO loss, beta is the regularization parameter denoted by tau in the paper. loss_type (`str`, defaults to `"sigmoid"`): The type of DPO loss to use. Either `"sigmoid"` the default DPO loss,`"hinge"` loss from SLiC paper or `"ipo"` from IPO paper. args (`transformers.TrainingArguments`):