Skip to content

Commit a145eaf

Browse files
refactor: Move CPOTrainer to experimental module (#4470)
1 parent d2dc717 commit a145eaf

File tree

15 files changed

+1423
-1355
lines changed

15 files changed

+1423
-1355
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,6 @@
5656
title: Examples
5757
- sections:
5858
- sections: # Sorted alphabetically
59-
- local: cpo_trainer
60-
title: CPO
6159
- local: dpo_trainer
6260
title: DPO
6361
- local: online_dpo_trainer
@@ -105,6 +103,8 @@
105103
title: BEMA for Reference Model
106104
- local: bco_trainer
107105
title: BCO
106+
- local: cpo_trainer
107+
title: CPO
108108
- local: gfpo
109109
title: GFPO
110110
- local: gold_trainer

docs/source/cpo_trainer.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Below is the script to train the model:
2424
```python
2525
# train_cpo.py
2626
from datasets import load_dataset
27-
from trl import CPOConfig, CPOTrainer
27+
from trl.experimental.cpo import CPOConfig, CPOTrainer
2828
from transformers import AutoModelForCausalLM, AutoTokenizer
2929

3030
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -44,7 +44,7 @@ accelerate launch train_cpo.py
4444

4545
## Expected dataset type
4646

47-
CPO requires a [preference dataset](dataset_formats#preference). The [`CPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
47+
CPO requires a [preference dataset](dataset_formats#preference). The [`experimental.cpo.CPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
4848

4949
## Example script
5050

@@ -80,31 +80,31 @@ The abstract from the paper is the following:
8080

8181
> Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3. We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.
8282

83-
The SimPO loss is integrated in the [`CPOTrainer`], as it's an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, just turn on `loss_type="simpo"` and `cpo_alpha=0.0` in the [`CPOConfig`] and set the `simpo_gamma` to a recommended value.
83+
The SimPO loss is integrated in the [`experimental.cpo.CPOTrainer`], as it's an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, just turn on `loss_type="simpo"` and `cpo_alpha=0.0` in the [`experimental.cpo.CPOConfig`] and set the `simpo_gamma` to a recommended value.
8484

8585
### CPO-SimPO
8686

87-
We also offer the combined use of CPO and SimPO, which enables more stable training and improved performance. Learn more details at [CPO-SimPO GitHub](https://github.com/fe1ixxu/CPO_SIMPO). To use this method, simply enable SimPO by setting `loss_type="simpo"` and a non-zero `cpo_alpha` in the [`CPOConfig`].
87+
We also offer the combined use of CPO and SimPO, which enables more stable training and improved performance. Learn more details at [CPO-SimPO GitHub](https://github.com/fe1ixxu/CPO_SIMPO). To use this method, simply enable SimPO by setting `loss_type="simpo"` and a non-zero `cpo_alpha` in the [`experimental.cpo.CPOConfig`].
8888

8989
### AlphaPO
9090

91-
The [AlphaPO -- Reward shape matters for LLM alignment](https://huggingface.co/papers/2501.03884) (AlphaPO) method by Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, [Jiwoo Hong](https://huggingface.co/JW17), Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Jason Zhu, Natesh Pillai, and S. Sathiya Keerthi is also implemented in the [`CPOTrainer`]. AlphaPO is an alternative method that applies a transformation to the reward function shape in the context of SimPO loss. The abstract from the paper is the following:
91+
The [AlphaPO -- Reward shape matters for LLM alignment](https://huggingface.co/papers/2501.03884) (AlphaPO) method by Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, [Jiwoo Hong](https://huggingface.co/JW17), Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Jason Zhu, Natesh Pillai, and S. Sathiya Keerthi is also implemented in the [`experimental.cpo.CPOTrainer`]. AlphaPO is an alternative method that applies a transformation to the reward function shape in the context of SimPO loss. The abstract from the paper is the following:
9292

9393
> Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce AlphaPO, a new DAA method that leverages an α-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and overoptimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15% to 50% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one can systematically change it to affect training dynamics, as well as improve alignment performance.
9494

95-
To use this loss as described in the paper, we can set the `loss_type="alphapo"` which automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`, together with `alpha` and `simpo_gamma` to recommended values in the [`CPOConfig`]. Alternatively, you can manually set `loss_type="simpo"`, `cpo_alpha=0.0`, together with `alpha` and `simpo_gamma` to recommended values. Other variants of this method are also possible, such as setting `loss_type="ipo"` and `alpha` to any non-zero value.
95+
To use this loss as described in the paper, we can set the `loss_type="alphapo"` which automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`, together with `alpha` and `simpo_gamma` to recommended values in the [`experimental.cpo.CPOConfig`]. Alternatively, you can manually set `loss_type="simpo"`, `cpo_alpha=0.0`, together with `alpha` and `simpo_gamma` to recommended values. Other variants of this method are also possible, such as setting `loss_type="ipo"` and `alpha` to any non-zero value.
9696

9797
## Loss functions
9898

99-
The CPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`CPOConfig`]. The following loss functions are supported:
99+
The CPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`experimental.cpo.CPOConfig`]. The following loss functions are supported:
100100

101101
| `loss_type=` | Description |
102102
| --- | --- |
103103
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model, and in fact, the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
104104
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
105105
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair, and thus the smaller the `beta`, the larger this gap is. As per the paper, the loss is averaged over log-likelihoods of the completion (unlike DPO, which is summed only). |
106-
| `"simpo"` | The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the [`CPOTrainer`]. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, simply set `loss_type="simpo"` and `cpo_alpha=0.0` in the [`CPOConfig`] and `simpo_gamma` to a recommended value. |
107-
| `"alphapo"` | The [AlphaPO](https://huggingface.co/papers/2501.03884) method is also implemented in the [`CPOTrainer`]. This is syntactic sugar that automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`. AlphaPO applies a transformation to the reward function shape in the context of SimPO loss when the `alpha` parameter is non-zero. |
106+
| `"simpo"` | The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the [`experimental.cpo.CPOTrainer`]. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, simply set `loss_type="simpo"` and `cpo_alpha=0.0` in the [`experimental.cpo.CPOConfig`] and `simpo_gamma` to a recommended value. |
107+
| `"alphapo"` | The [AlphaPO](https://huggingface.co/papers/2501.03884) method is also implemented in the [`experimental.cpo.CPOTrainer`]. This is syntactic sugar that automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`. AlphaPO applies a transformation to the reward function shape in the context of SimPO loss when the `alpha` parameter is non-zero. |
108108

109109
### For Mixture of Experts Models: Enabling the auxiliary loss
110110

@@ -116,11 +116,11 @@ To scale how much the auxiliary loss contributes to the total loss, use the hype
116116

117117
## CPOTrainer
118118

119-
[[autodoc]] CPOTrainer
119+
[[autodoc]] experimental.cpo.CPOTrainer
120120
- train
121121
- save_model
122122
- push_to_hub
123123

124124
## CPOConfig
125125

126-
[[autodoc]] CPOConfig
126+
[[autodoc]] experimental.cpo.CPOConfig

docs/source/dataset_formats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -388,7 +388,7 @@ Choosing the right dataset type depends on the task you are working on and the s
388388
| Trainer | Expected dataset type |
389389
| --- | --- |
390390
| [`experimental.bco.BCOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
391-
| [`CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
391+
| [`experimental.cpo.CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
392392
| [`DPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
393393
| [`GKDTrainer`] | [Prompt-completion](#prompt-completion) |
394394
| [`GRPOTrainer`] | [Prompt-only](#prompt-only) |

docs/source/example_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
4040
File | Description |
4141
| --- | --- |
4242
| [`examples/scripts/bco.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/bco.py) | This script shows how to use the [`KTOTrainer`] with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty, and helpfulness using the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. |
43-
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
43+
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`experimental.cpo.CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
4444
| [`trl/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a model. |
4545
| [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset. |
4646
| [`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py) | This script shows how to use [`HfPairwiseJudge`] or [`experimental.judges.OpenAIPairwiseJudge`] to judge model generations. |

docs/source/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ Below is the current list of TRL trainers, organized by method type (⚡️ = vL
2626
- [`RLOOTrainer`] ⚡️
2727
- [`OnlineDPOTrainer`] ⚡️
2828
- [`NashMDTrainer`] ⚡️
29-
- [`experimental.xpo.XPOTrainer`] 🧪 ⚡️
3029
- [`PPOTrainer`]
30+
- [`experimental.xpo.XPOTrainer`] 🧪 ⚡️
3131

3232
### Reward modeling
3333

@@ -42,9 +42,9 @@ Below is the current list of TRL trainers, organized by method type (⚡️ = vL
4242
- [`SFTTrainer`]
4343
- [`DPOTrainer`]
4444
- [`ORPOTrainer`]
45-
- [`experimental.bco.BCOTrainer`] 🧪
46-
- [`CPOTrainer`]
4745
- [`KTOTrainer`]
46+
- [`experimental.bco.BCOTrainer`] 🧪
47+
- [`experimental.cpo.CPOTrainer`] 🧪
4848

4949
### Knowledge distillation
5050

docs/source/paper_index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -556,7 +556,7 @@ training_args = RLOOConfig(
556556

557557
## Contrastive Preference Optimization
558558

559-
Papers relating to the [`CPOTrainer`]
559+
Papers relating to the [`experimental.cpo.CPOTrainer`]
560560

561561
### AlphaPO -- Reward shape matters for LLM alignment
562562

@@ -565,7 +565,7 @@ Papers relating to the [`CPOTrainer`]
565565
AlphaPO is a new Direct Alignment Algorithms (DAAs) method that leverages an alpha-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. To reproduce the paper's setting, use this configuration:
566566

567567
```python
568-
from trl import CPOConfig
568+
from trl.experimental.cpo import CPOConfig
569569

570570
# Mistral-Instruct from Table 3 of the paper
571571
training_args = CPOConfig(

examples/scripts/cpo.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,8 @@
6363
from datasets import load_dataset
6464
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
6565

66-
from trl import CPOConfig, CPOTrainer, ModelConfig, ScriptArguments, get_peft_config
66+
from trl import ModelConfig, ScriptArguments, get_peft_config
67+
from trl.experimental.cpo import CPOConfig, CPOTrainer
6768

6869

6970
# Enable logging in a Hugging Face Space

tests/test_cpo_trainer.py renamed to tests/experimental/test_cpo_trainer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@
1717
from datasets import load_dataset
1818
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer
1919

20-
from trl import CPOConfig, CPOTrainer
20+
from trl.experimental.cpo import CPOConfig, CPOTrainer
2121

22-
from .testing_utils import TrlTestCase, require_peft
22+
from ..testing_utils import TrlTestCase, require_peft
2323

2424

2525
class TestCPOTrainer(TrlTestCase):

tests/experimental/test_trainers_args.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
1818

1919
from trl.experimental.bco import BCOConfig, BCOTrainer
20+
from trl.experimental.cpo import CPOConfig, CPOTrainer
2021
from trl.experimental.xpo import XPOConfig, XPOTrainer
2122

2223
from ..testing_utils import TrlTestCase, require_sklearn
@@ -71,6 +72,47 @@ def test_bco(self):
7172
assert trainer.args.min_density_ratio == 0.2
7273
assert trainer.args.max_density_ratio == 20.0
7374

75+
def test_cpo(self):
76+
model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
77+
tokenizer = AutoTokenizer.from_pretrained(model_id)
78+
dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train")
79+
training_args = CPOConfig(
80+
self.tmp_dir,
81+
max_length=256,
82+
max_prompt_length=64,
83+
max_completion_length=64,
84+
beta=0.5,
85+
label_smoothing=0.5,
86+
loss_type="hinge",
87+
disable_dropout=False,
88+
cpo_alpha=0.5,
89+
simpo_gamma=0.2,
90+
label_pad_token_id=-99,
91+
padding_value=-99,
92+
truncation_mode="keep_start",
93+
# generate_during_eval=True, # ignore this one, it requires wandb
94+
is_encoder_decoder=True,
95+
model_init_kwargs={"trust_remote_code": True},
96+
dataset_num_proc=4,
97+
)
98+
trainer = CPOTrainer(model=model_id, args=training_args, train_dataset=dataset, processing_class=tokenizer)
99+
assert trainer.args.max_length == 256
100+
assert trainer.args.max_prompt_length == 64
101+
assert trainer.args.max_completion_length == 64
102+
assert trainer.args.beta == 0.5
103+
assert trainer.args.label_smoothing == 0.5
104+
assert trainer.args.loss_type == "hinge"
105+
assert not trainer.args.disable_dropout
106+
assert trainer.args.cpo_alpha == 0.5
107+
assert trainer.args.simpo_gamma == 0.2
108+
assert trainer.args.label_pad_token_id == -99
109+
assert trainer.args.padding_value == -99
110+
assert trainer.args.truncation_mode == "keep_start"
111+
# self.assertEqual(trainer.args.generate_during_eval, True)
112+
assert trainer.args.is_encoder_decoder
113+
assert trainer.args.model_init_kwargs == {"trust_remote_code": True}
114+
assert trainer.args.dataset_num_proc == 4
115+
74116
@pytest.mark.parametrize("alpha_list", [False, True])
75117
def test_xpo(self, alpha_list):
76118
model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"

0 commit comments

Comments
 (0)