-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed ZeRO-2 produces negative KL divergence #506
Comments
I was not able to reproduce the same issue. Probably the related bugs were already fixed in the latest main. The report is here https://wandb.ai/costa-huang/trl/reports/deepspeed-test--Vmlldzo1MTc3NDcw And here is one of the runs: https://wandb.ai/costa-huang/trl/runs/viz5drqj/logs?workspace=user-costa-huang, and its logs seem to indicate deepspeed is running as expected. |
Thanks a lot for diving into this! I'm still getting a negative KL even after bumping |
Ah if you look closely at the KL divergence of your run (https://wandb.ai/costa-huang/trl/runs/viz5drqj?workspace=user-lewtun), one sees that it is indeed still slightly negative: Since step 0 should be a direct match between the reference & active models, it would make sense to see if we can understand why deepspeed is causing this difference. One possibility is that deepspeed is setting the active mode in train model (e.g. with dropout) while the reference mode in in eval model (no dropout) |
https://wandb.ai/costa-huang/trl/runs/viz5drqj/files/requirements.txt has all dependencies. I used your
Interesting. Thanks for bringing up this point. I will look into it! |
An explanation could be that maybe model is not in eval mode? In that case you could have a little bit of noise even if the models are identical. |
Closed by #758 (the root cause of the issue was using |
Hello, while testing out the DeepSpeed ZeRO-2 plugin in the sentiment example for
gpt2
, I noticed that the KL divergence starts out negative. This suggests the model parameters of the reference and active model are being sharded in a peculiar manner that produces a mismatch in the log probs.Below is a screenshot from WandB which shows the pure DDP baseline in teal vs the Z3 curve in purple:
Code to reproduce
I ran this on 2 x A100 (80GB) machines, but that's overkill for this example :)
Accelerate config
Script
Run with
accelerate launch --config_file config.yaml gpt2_sentiment.py --log_with="wandb"
Env
The text was updated successfully, but these errors were encountered: