Skip to content

Conversation

@qingquansong
Copy link

@qingquansong qingquansong commented Jul 26, 2025

What does this PR do?

This PR adds AlphaPO (ICML 2025; paper: arXiv:2501.03884) trainer. AlphaPO is a new Direct Alignment Algorithm (DAA) that introduces an α-parameter to help change the shape of the reward function beyond the standard log reward. When α=0, the objective reduces to SimPO (arXiv:2405.14734). The implementation ships with:

  • AlphaPOTrainer and AlphaPOConfig
  • Mixed-precision, FSDP, and LoRA/QLoRA support (mirroring existing TRL trainers)
  • Dataset adapters for common “prompt / chosen / rejected” preference formats
  • Unit tests, docstrings, and an end-to-end example script

No breaking changes to existing APIs.

The PR is co-authored with @lancerts @srzhu97 @agupta2304

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@qingquansong
Copy link
Author

cc @kashif @lewtun

@kashif
Copy link
Collaborator

kashif commented Jul 26, 2025

thanks all! I wanted to ask, would it be possible to add this loss to the cpo trainer with an alpha config (default to zero) rather than a dedicated trainer?

@lancerts
Copy link

lancerts commented Jul 26, 2025

cpo trainer

@kashif Thanks for the comments.
For the CPO loss, from paper https://arxiv.org/pdf/2401.08417
Screenshot 2025-07-26 at 7 34 01 AM

For the AlphaPO loss, from paper https://arxiv.org/pdf/2501.03884
Screenshot 2025-07-26 at 7 35 02 AM

Thank you for raising these points! Here are the key differences between the CPO and AlphaPO losses:

  1. Shift term (γ)

    • Original CPO: does not include a shift (γ) term.
    • HF implementation: the cpo_trainer.py adds the SIMPO-style shift γ but the effect was not clear from the paper.
    • The γ in AlphaPO has an explict influence of the response length, ablation done in the paper.
  2. Reference policy (πₜₕₑₜₐ vs. length‑normalized πₜₕₑₜₐ)

    • CPO: computes loss directly w.r.t. the raw policy $$\pi_{θ}$$.
    • AlphaPO: operates on a length‑normalized policy $$\pi_{θ}^{1/y}$$ .
  3. Inclusion of α

    • CPO: does not include an explicit α coefficient.
    • AlphaPO: incorporates a tunable α parameter to control the update magnitude.

Combining AlphaPO with CPO risks confusing users; providing a dedicated AlphaPO trainer offers a much clearer interface.

Please let me know if you’d like any further clarification!

@kashif
Copy link
Collaborator

kashif commented Jul 26, 2025

yes so keeping with the design of TRL I was able to integrate the AlphaPO reward quite simply as: main...kashif:trl:alphaPO-cpo-integration

@kashif
Copy link
Collaborator

kashif commented Jul 28, 2025

we also now have a dedicated "papers index" page where we can add the configurations for the AlphaPO via the CPOTrainer class: https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md

@lancerts
Copy link

lancerts commented Jul 29, 2025

yes so keeping with the design of TRL I was able to integrate the AlphaPO reward quite simply as: main...kashif:trl:alphaPO-cpo-integration

we may need to add a mode = ENUM(CPO, AlphaPO). For AlphaPO, we need to use the length normalized logps so
average_log_prob = True (instead of False).

If user chose AlphaPO and average_log_prob = False, need to surface a warning.

@kashif
Copy link
Collaborator

kashif commented Jul 31, 2025

thanks @lancerts so by default the average_log_prob=self.loss_type in ["ipo", "simpo"], so that will work with the AlphaPO settings out of the box

@kashif
Copy link
Collaborator

kashif commented Aug 1, 2025

closing this for now and moving PR to #3824

@kashif kashif closed this Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants