You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
8xH20(96G) OOM issue when doing bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768 date +'%Y%m%d_%H'0000
But GPU_Megatron.md says it can run. Do I need to do special settings?
In small experiments, I reduce the --num-layers to 1 and run below commands (4 gpus, tp4pp1, no other modify),
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 4096 4096 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 4 iterations) memory (MB) | allocated: 1323.32421875 | max allocated: 4025.59619140625 | reserved: 7058.0 | max reserved: 7058.0
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 8192 8192 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 2 iterations) memory (MB) | allocated: 1553.5546875 | max allocated: 12110.4306640625 | reserved: 19228.0 | max reserved: 19228.0
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 16384 16384 date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 1 iterations) memory (MB) | allocated: 2215.875 | max allocated: 43931.10498046875 | reserved: 65950.0 | max reserved: 66072.0
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768 date +'%Y%m%d_%H'0000
the log shows OOM
So when the number of layers goes back to 40, it will be OOM. Is it due to recomputing not opened? But I feel that vit and qwen2.5-14b are frozen in stage 1 and should not have such a large memory footprint.
The text was updated successfully, but these errors were encountered:
In stage 1, we use tp8pp1 on 8x96G and it works well.
There may be some differences in environment; for example, we use FlashAttention-3.
You can check out our environment configuration in #4
Thanks for your great works!
8xH20(96G) OOM issue when doing
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768
date +'%Y%m%d_%H'0000
But GPU_Megatron.md says it can run. Do I need to do special settings?
In small experiments, I reduce the --num-layers to 1 and run below commands (4 gpus, tp4pp1, no other modify),
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 4096 4096
date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 4 iterations) memory (MB) | allocated: 1323.32421875 | max allocated: 4025.59619140625 | reserved: 7058.0 | max reserved: 7058.0
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 8192 8192
date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 2 iterations) memory (MB) | allocated: 1553.5546875 | max allocated: 12110.4306640625 | reserved: 19228.0 | max reserved: 19228.0
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 16384 16384
date +'%Y%m%d_%H'0000
the log shows [Rank 1] (after 1 iterations) memory (MB) | allocated: 2215.875 | max allocated: 43931.10498046875 | reserved: 65950.0 | max reserved: 66072.0
bash scripts/megatron/qwen25/finetune_qwen25_14b_intern_300m_ptd_tp8pp1_stage1.sh 32768 32768
date +'%Y%m%d_%H'0000
the log shows OOM
So when the number of layers goes back to 40, it will be OOM. Is it due to recomputing not opened? But I feel that vit and qwen2.5-14b are frozen in stage 1 and should not have such a large memory footprint.
The text was updated successfully, but these errors were encountered: