Fix reported KL in PPO trainer #1180

mgerstgrasser · 2024-01-05T05:58:40Z

PPO trainer currently separately calculates the KL that's reported in the stats and to wandb, and always uses the estimated KL for this. That's fine when the estimated KL is also used for training (kl_penalty = 'kl'), but leads to a mismatch when using kl_penalty = 'abs', kl_penalty = 'mse', or kl_penalty = 'full'. Additionally, this can lead to a warning about negative KL being triggered even in cases where the KL used in training is not and cannot be negative.

This PR changes it so that instead the KL that is calculated during training is kept and used for stats and reporting.

Tested using the PPO example script in both single-GPU and Deepspeed stage 2 training, leading to identical results in the kl_penalty = 'kl'case. Verified also that kl_penalty = 'abs' now does not trigger the negative KL warning anymore.

Closes #1161.

previously this was always reporting the estimated KL, even when using `kl_penalty = 'full'` (or `abs`, etc). Now we return the actual KL calculated in `compute_rewards()`, and report that.

younesbelkada · 2024-01-08T04:48:48Z

cc @vwxyzjn @lvwerra would you be able to have a look here whenever possible 🙏

HuggingFaceDocBuilderDev · 2024-01-08T04:50:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vwxyzjn · 2024-01-08T14:31:01Z

I think this is good. It's better to be more consistent. Any objections @lvwerra?

younesbelkada · 2024-01-09T05:48:19Z

thanks @vwxyzjn @lvwerra ! will merge then!

* Fix reported KL in PPO trainer previously this was always reporting the estimated KL, even when using `kl_penalty = 'full'` (or `abs`, etc). Now we return the actual KL calculated in `compute_rewards()`, and report that. * fix test

Fix reported KL in PPO trainer

d5a4417

previously this was always reporting the estimated KL, even when using `kl_penalty = 'full'` (or `abs`, etc). Now we return the actual KL calculated in `compute_rewards()`, and report that.

fix test

be7fd72

lvwerra approved these changes Jan 8, 2024

View reviewed changes

younesbelkada merged commit a236c57 into huggingface:main Jan 9, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reported KL in PPO trainer #1180

Fix reported KL in PPO trainer #1180

mgerstgrasser commented Jan 5, 2024

younesbelkada commented Jan 8, 2024

HuggingFaceDocBuilderDev commented Jan 8, 2024

vwxyzjn commented Jan 8, 2024

younesbelkada commented Jan 9, 2024

Fix reported KL in PPO trainer #1180

Fix reported KL in PPO trainer #1180

Conversation

mgerstgrasser commented Jan 5, 2024

younesbelkada commented Jan 8, 2024

HuggingFaceDocBuilderDev commented Jan 8, 2024

vwxyzjn commented Jan 8, 2024

younesbelkada commented Jan 9, 2024