-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add AlphaPO Trainer #3776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AlphaPO Trainer #3776
Conversation
update md and trainer
Update config
Update md
remove nll loss computation
Remove average_log_prob flag
|
thanks all! I wanted to ask, would it be possible to add this loss to the cpo trainer with an alpha config (default to zero) rather than a dedicated trainer? |
@kashif Thanks for the comments. For the AlphaPO loss, from paper https://arxiv.org/pdf/2501.03884 Thank you for raising these points! Here are the key differences between the CPO and AlphaPO losses:
Combining AlphaPO with CPO risks confusing users; providing a dedicated AlphaPO trainer offers a much clearer interface. Please let me know if you’d like any further clarification! |
|
yes so keeping with the design of TRL I was able to integrate the AlphaPO reward quite simply as: main...kashif:trl:alphaPO-cpo-integration |
|
we also now have a dedicated "papers index" page where we can add the configurations for the AlphaPO via the CPOTrainer class: https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md |
we may need to add a mode = ENUM(CPO, AlphaPO). For AlphaPO, we need to use the length normalized logps so If user chose AlphaPO and average_log_prob = False, need to surface a warning. |
|
thanks @lancerts so by default the |
|
closing this for now and moving PR to #3824 |


What does this PR do?
This PR adds AlphaPO (ICML 2025; paper: arXiv:2501.03884) trainer. AlphaPO is a new Direct Alignment Algorithm (DAA) that introduces an α-parameter to help change the shape of the reward function beyond the standard log reward. When α=0, the objective reduces to SimPO (arXiv:2405.14734). The implementation ships with:
No breaking changes to existing APIs.
The PR is co-authored with @lancerts @srzhu97 @agupta2304
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.