OOM when doing Stage-1 training #7

liuruijin17 · 2025-03-11T13:14:59Z

Thanks for your great works!

8xH20(96G) OOM issue when doing
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768 date +'%Y%m%d_%H'0000

But GPU_Megatron.md says it can run. Do I need to do special settings?

In small experiments, I reduce the --num-layers to 1 and run below commands (4 gpus, tp4pp1, no other modify),

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 4096 4096 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 4 iterations) memory (MB) | allocated: 1323.32421875 | max allocated: 4025.59619140625 | reserved: 7058.0 | max reserved: 7058.0

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 8192 8192 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 2 iterations) memory (MB) | allocated: 1553.5546875 | max allocated: 12110.4306640625 | reserved: 19228.0 | max reserved: 19228.0

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 16384 16384 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 1 iterations) memory (MB) | allocated: 2215.875 | max allocated: 43931.10498046875 | reserved: 65950.0 | max reserved: 66072.0

bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768 date +'%Y%m%d_%H'0000
the log shows OOM

So when the number of layers goes back to 40, it will be OOM. Is it due to recomputing not opened? But I feel that vit and qwen2.5-14b are frozen in stage 1 and should not have such a large memory footprint.

The text was updated successfully, but these errors were encountered:

shenyunhang · 2025-03-12T03:34:42Z

In stage 1, we use tp8pp1 on 8x96G and it works well.
There may be some differences in environment; for example, we use FlashAttention-3.
You can check out our environment configuration in #4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when doing Stage-1 training #7

OOM when doing Stage-1 training #7

liuruijin17 commented Mar 11, 2025

shenyunhang commented Mar 12, 2025

OOM when doing Stage-1 training #7

OOM when doing Stage-1 training #7

Comments

liuruijin17 commented Mar 11, 2025

shenyunhang commented Mar 12, 2025