Skip to content

[BUG] OOM when train 70B models using deepspeed 0.16.4 #7116

@hijkzzz

Description

@hijkzzz

We found that using OpenRLHF + DeepSpeed 0.15.0, SFT + Adam Offload can train a 70B model with 8 A100 70G + ZeRO3, whereas DeepSpeed 0.16.4 results in OOM. You can try the script https://github.com/OpenRLHF/OpenRLHF/blob/main/examples/scripts/train_sft_llama.sh and use the 70B model + Adam Offload to reproduce the issue.
It looks like this is a serious bug that deepspeed 16.4 can't train 70b models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdeepspeed-chatRelated to DeepSpeed-Chat

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions