Add AlphaPO Trainer #3776

qingquansong · 2025-07-26T03:44:29Z

What does this PR do?

This PR adds AlphaPO (ICML 2025; paper: arXiv:2501.03884) trainer. AlphaPO is a new Direct Alignment Algorithm (DAA) that introduces an α-parameter to help change the shape of the reward function beyond the standard log reward. When α=0, the objective reduces to SimPO (arXiv:2405.14734). The implementation ships with:

AlphaPOTrainer and AlphaPOConfig
Mixed-precision, FSDP, and LoRA/QLoRA support (mirroring existing TRL trainers)
Dataset adapters for common “prompt / chosen / rejected” preference formats
Unit tests, docstrings, and an end-to-end example script

No breaking changes to existing APIs.

The PR is co-authored with @lancerts @srzhu97 @agupta2304

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Co-authored-by: agupta2304 <aman.gupta@nubank.com.br> Co-authored-by: lancerts <tangshao28@gmail.com>

update md and trainer

Update config

Update md

remove nll loss computation

Remove average_log_prob flag

qingquansong · 2025-07-26T04:01:27Z

cc @kashif @lewtun

kashif · 2025-07-26T10:46:40Z

thanks all! I wanted to ask, would it be possible to add this loss to the cpo trainer with an alpha config (default to zero) rather than a dedicated trainer?

lancerts · 2025-07-26T14:40:28Z

cpo trainer

@kashif Thanks for the comments.
For the CPO loss, from paper https://arxiv.org/pdf/2401.08417

For the AlphaPO loss, from paper https://arxiv.org/pdf/2501.03884

Thank you for raising these points! Here are the key differences between the CPO and AlphaPO losses:

Shift term (γ)
- Original CPO: does not include a shift (γ) term.
- HF implementation: the cpo_trainer.py adds the SIMPO-style shift γ but the effect was not clear from the paper.
- The γ in AlphaPO has an explict influence of the response length, ablation done in the paper.
Reference policy (πₜₕₑₜₐ vs. length‑normalized πₜₕₑₜₐ)
- CPO: computes loss directly w.r.t. the raw policy $$\pi_{θ}$$.
- AlphaPO: operates on a length‑normalized policy $$\pi_{θ}^{1/y}$$ .
Inclusion of α
- CPO: does not include an explicit α coefficient.
- AlphaPO: incorporates a tunable α parameter to control the update magnitude.

Combining AlphaPO with CPO risks confusing users; providing a dedicated AlphaPO trainer offers a much clearer interface.

Please let me know if you’d like any further clarification!

kashif · 2025-07-26T17:13:04Z

yes so keeping with the design of TRL I was able to integrate the AlphaPO reward quite simply as: main...kashif:trl:alphaPO-cpo-integration

kashif · 2025-07-28T07:17:24Z

we also now have a dedicated "papers index" page where we can add the configurations for the AlphaPO via the CPOTrainer class: https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md

lancerts · 2025-07-29T04:08:46Z

yes so keeping with the design of TRL I was able to integrate the AlphaPO reward quite simply as: main...kashif:trl:alphaPO-cpo-integration

we may need to add a mode = ENUM(CPO, AlphaPO). For AlphaPO, we need to use the length normalized logps so
average_log_prob = True (instead of False).

If user chose AlphaPO and average_log_prob = False, need to surface a warning.

kashif · 2025-07-31T13:13:25Z

thanks @lancerts so by default the average_log_prob=self.loss_type in ["ipo", "simpo"], so that will work with the AlphaPO settings out of the box

kashif · 2025-08-01T08:48:28Z

closing this for now and moving PR to #3824

qingquansong and others added 25 commits July 21, 2025 20:08

init push

ed99153

Co-authored-by: agupta2304 <aman.gupta@nubank.com.br> Co-authored-by: lancerts <tangshao28@gmail.com>

update loss

8bed734

update md and trainer

ad297ce

Merge branch 'qsong/alphapo' into shaodev

42a3470

Merge pull request #3 from qingquansong/shaodev

518483a

update md and trainer

update config

5888c2e

update config

68e5c00

Merge pull request #4 from qingquansong/shao11

7dc6b02

Update config

revert config post init

7be2a5e

update alphpo trainer readme

7ee37d4

remove alphapo from community tutorials

65a9c2d

update md

eb38838

remove chat example for alphapo in readme

73959bc

update md

d95e97e

update md

566af9c

Merge branch 'qsong/alphapo' into shao111

13573e7

Merge pull request #5 from qingquansong/shao111

c20d7e2

Update md

remove nll loss computation

99bea4e

update simpo reference

88c59b7

update alpha = 0 edge case

43562df

Merge pull request #6 from qingquansong/shaodev1

ffbe833

remove nll loss computation

Merge branch 'huggingface:main' into qsong/alphapo

69783b1

remove log sum flag

4cd3536

Merge pull request #7 from qingquansong/sirzhu/dev

1c9a174

Remove average_log_prob flag

Merge branch 'main' into qsong/alphapo

4bcca3b

Run isort, black, flake8

a5d4b08

kashif mentioned this pull request Jul 31, 2025

🗿 [CPO] Add AlphaPO method via CPOTrainer #3824

Merged

kashif closed this Aug 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AlphaPO Trainer #3776

Add AlphaPO Trainer #3776

Uh oh!

qingquansong commented Jul 26, 2025 •

edited

Loading

Uh oh!

qingquansong commented Jul 26, 2025

Uh oh!

kashif commented Jul 26, 2025

Uh oh!

lancerts commented Jul 26, 2025 •

edited

Loading

Uh oh!

kashif commented Jul 26, 2025

Uh oh!

kashif commented Jul 28, 2025

Uh oh!

lancerts commented Jul 29, 2025 •

edited

Loading

Uh oh!

kashif commented Jul 31, 2025

Uh oh!

kashif commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add AlphaPO Trainer #3776

Add AlphaPO Trainer #3776

Uh oh!

Conversation

qingquansong commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

qingquansong commented Jul 26, 2025

Uh oh!

kashif commented Jul 26, 2025

Uh oh!

lancerts commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Jul 26, 2025

Uh oh!

kashif commented Jul 28, 2025

Uh oh!

lancerts commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Jul 31, 2025

Uh oh!

kashif commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qingquansong commented Jul 26, 2025 •

edited

Loading

lancerts commented Jul 26, 2025 •

edited

Loading

lancerts commented Jul 29, 2025 •

edited

Loading