You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We found that using OpenRLHF + DeepSpeed 0.15.0, SFT + Adam Offload can train a 70B model with 8 A100 70G + ZeRO3, whereas DeepSpeed 0.16.4 results in OOM. You can try the script https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh and use the 70B model + Adam Offload to reproduce the issue.
It looks like this is a serious bug that deepspeed 16.4 can't train 70b models.