Skip to content

Commit d478875

Browse files
kashifqgallouedec
authored andcommitted
🗿 [CPO] Add AlphaPO method via CPOTrainer (huggingface#3824)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
1 parent 5907601 commit d478875

15 files changed

+146
-28
lines changed

docs/source/bco_trainer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ For a full example have a look at [`examples/scripts/bco.py`].
99
## Expected dataset type
1010

1111
The [`BCOTrainer`] requires an [unpaired preference dataset](dataset_formats#unpaired-preference).
12-
The [`BCOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
12+
The [`BCOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
1313

1414
## Expected model format
1515
The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.

docs/source/cpo_trainer.md

Lines changed: 25 additions & 7 deletions
Large diffs are not rendered by default.

docs/source/dataset_formats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ For more detailed information on tool calling, refer to the [Tool Calling sectio
213213

214214
The [Harmony response format](https://cookbook.openai.com/articles/openai-harmony) was introduced with the [OpenAI GPT OSS models](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4). It extends the conversational format by adding richer structure for reasoning, function calls, and metadata about the model’s behavior. Key features include:
215215

216-
- **Developer role** – Provides high-level instructions (similar to a system prompt) and lists available tools.
216+
- **Developer role** – Provides high level instructions (similar to a system prompt) and lists available tools.
217217
- **Channels** – Separate types of assistant output into distinct streams:
218218

219219
- `analysis` – for internal reasoning, from the key `"thinking"`

docs/source/dpo_trainer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ accelerate launch trl/scripts/dpo.py \
118118

119119
## Logged metrics
120120

121-
While training and evaluating we record the following reward metrics:
121+
While training and evaluating, we record the following reward metrics:
122122

123123
- `rewards/chosen`: the mean difference between the log probabilities of the policy model and the reference model for the chosen responses scaled by beta
124124
- `rewards/rejected`: the mean difference between the log probabilities of the policy model and the reference model for the rejected responses scaled by beta

docs/source/grpo_trainer.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,8 @@ This constant is recommended to be the maximum completion length. To use this fo
149149

150150
## Logged metrics
151151

152+
While training and evaluating, we record the following reward metrics:
153+
152154
- `num_tokens`: The total number of tokens processed so far, including both prompts and completions.
153155
- `completions/mean_length`: The average length of generated completions.
154156
- `completions/min_length`: The minimum length of generated completions.

docs/source/kto_trainer.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Here are some other factors to consider when choosing a programming language for
7474

7575
KTO requires an [unpaired preference dataset](dataset_formats#unpaired-preference). Alternatively, you can provide a *paired* preference dataset (also known simply as a *preference dataset*). In this case, the trainer will automatically convert it to an unpaired format by separating the chosen and rejected responses, assigning `label = True` to the chosen completions and `label = False` to the rejected ones.
7676

77-
The [`KTOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
77+
The [`KTOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
7878

7979
In theory, the dataset should contain at least one chosen and one rejected completion. However, some users have successfully run KTO using *only* chosen or only rejected data. If using only rejected data, it is advisable to adopt a conservative learning rate.
8080

@@ -118,7 +118,7 @@ By default, they are both 1. However, if you have more of one or the other, then
118118

119119
## Logged metrics
120120

121-
While training and evaluating we record the following reward metrics:
121+
While training and evaluating, we record the following reward metrics:
122122

123123
- `rewards/chosen_sum`: the sum of log probabilities of the policy model for the chosen responses scaled by beta
124124
- `rewards/rejected_sum`: the sum of log probabilities of the policy model for the rejected responses scaled by beta

docs/source/nash_md_trainer.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ The best programming language depends on personal preference, the complexity of
6363

6464
## Expected dataset type
6565

66-
Nash-MD requires a [prompt-only dataset](dataset_formats#prompt-only). The [`NashMDTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
66+
Nash-MD requires a [prompt-only dataset](dataset_formats#prompt-only). The [`NashMDTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
6767

6868
## Usage tips
6969

@@ -132,7 +132,7 @@ python examples/scripts/nash_md.py \
132132

133133
## Logged metrics
134134

135-
The logged metrics are as follows:
135+
While training and evaluating, we record the following reward metrics:
136136

137137
* `loss/kl`: The mean KL divergence between the model and reference data.
138138
* `objective/entropy`: The mean entropy of the model and reference data.

docs/source/online_dpo_trainer.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ The best programming language depends on your specific needs and priorities. Som
6565

6666
## Expected dataset type
6767

68-
Online DPO only requires a [prompt-only dataset](dataset_formats#prompt-only) (unlike offline DPO, that expects [preference dataset](dataset_formats#preference)). The [`OnlineDPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
68+
Online DPO only requires a [prompt-only dataset](dataset_formats#prompt-only) (unlike offline DPO, that expects [preference dataset](dataset_formats#preference)). The [`OnlineDPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
6969

7070
## Usage tips
7171

@@ -132,7 +132,7 @@ python examples/scripts/dpo_online.py \
132132

133133
## Logged metrics
134134

135-
The logged metrics are as follows. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/w4apmsi9)
135+
While training and evaluating, we record the following reward metrics. Here is an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/w4apmsi9)
136136

137137
* `objective/kl`: The mean Kullback-Leibler (KL) divergence between the current model and reference model.
138138
* `objective/entropy`: The mean entropy of the model, indicating the randomness of the actions chosen by the model.

docs/source/orpo_trainer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ To scale how much the auxiliary loss contributes to the total loss, use the hype
109109

110110
## Logged metrics
111111

112-
While training and evaluating we record the following reward metrics:
112+
While training and evaluating, we record the following reward metrics:
113113

114114
- `rewards/chosen`: the mean log probabilities of the policy model for the chosen responses scaled by beta
115115
- `rewards/rejected`: the mean log probabilities of the policy model for the rejected responses scaled by beta

docs/source/paper_index.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,26 @@ training_args = GRPOConfig(
2626
)
2727
```
2828

29+
## AlphaPO -- Reward shape matters for LLM alignment
30+
31+
**📜 Paper**: https://huggingface.co/papers/2501.03884
32+
33+
AlphaPO is a new Direct Alignment Algorithms (DAAs) method that leverages an alpha-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. To reproduce the paper's setting, use this configuration:
34+
35+
```python
36+
from trl import CPOConfig
37+
38+
# Mistral-Instruct from Table 3 of the paper
39+
training_args = CPOConfig(
40+
loss_type="alphapo",
41+
alpha=0.25,
42+
beta=2.5,
43+
simpo_gamma=0.1,
44+
learning_rate=7e-7,
45+
...
46+
)
47+
```
48+
2949
## EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
3050

3151
**📜 Paper**: https://huggingface.co/papers/2508.00180

0 commit comments

Comments
 (0)