Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when doing Stage-1 training #7

Open
liuruijin17 opened this issue Mar 11, 2025 · 1 comment
Open

OOM when doing Stage-1 training #7

liuruijin17 opened this issue Mar 11, 2025 · 1 comment

Comments

@liuruijin17
Copy link

Thanks for your great works!

8xH20(96G) OOM issue when doing
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768 date +'%Y%m%d_%H'0000

But GPU_Megatron.md says it can run. Do I need to do special settings?

In small experiments, I reduce the --num-layers to 1 and run below commands (4 gpus, tp4pp1, no other modify),

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 4096 4096 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 4 iterations) memory (MB) | allocated: 1323.32421875 | max allocated: 4025.59619140625 | reserved: 7058.0 | max reserved: 7058.0

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 8192 8192 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 2 iterations) memory (MB) | allocated: 1553.5546875 | max allocated: 12110.4306640625 | reserved: 19228.0 | max reserved: 19228.0

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 16384 16384 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 1 iterations) memory (MB) | allocated: 2215.875 | max allocated: 43931.10498046875 | reserved: 65950.0 | max reserved: 66072.0

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768 date +'%Y%m%d_%H'0000
the log shows OOM

So when the number of layers goes back to 40, it will be OOM. Is it due to recomputing not opened? But I feel that vit and qwen2.5-14b are frozen in stage 1 and should not have such a large memory footprint.

@shenyunhang
Copy link
Collaborator

In stage 1, we use tp8pp1 on 8x96G and it works well.
There may be some differences in environment; for example, we use FlashAttention-3.
You can check out our environment configuration in #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants