[BUG] deepspeed zero2 training hangon and timeout after a fixed step

**Describe the bug**
I use deepspeed zero2 to train a transformer-based dit model. However, the script always gets stuck at a fixed step after one hour of training. When I disable deepspeed and use pure pytorch DDP to train the code, the problem disappears. Moreover, even if I change the mixed-precision from fp16 to bf16 or adjust the learning rate, the same problem occurs and the script gets stuck at the same training step.

I have also tried changing the model initialization and resuming the training, but the script still gets stuck after the same training step. For example, if the script gets stuck after training 14K steps, I save the checkpoint at the 10K step and resume training with the 10K-step checkpoint. Then, after training another 14K steps again, the script gets stuck once more.

**ds_report output**
[rank0]:[E217 07:24:07.938443115 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=54053, OpType=ALLREDUCE, NumelIn=497464352, NumelOut=497464352, Timeout(ms)=600000) ran for 600087 milliseconds before timing out.
[rank0]:[E217 07:24:07.938551513 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 54053, last enqueued NCCL work: 54054, last completed NCCL work: 54052.
[rank0]:[E217 07:24:08.990085728 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 0] Timeout at NCCL work: 54053, last enqueued NCCL work: 54054, last completed NCCL work: 54052.
[rank0]:[E217 07:24:08.990112551 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E217 07:24:08.990119150 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E217 07:24:08.991954167 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=54053, OpType=ALLREDUCE, NumelIn=497464352, NumelOut=497464352, Timeout(ms)=600000) ran for 600087 milliseconds before timing out.
Exception raised from checkTimeout at /opt/tiger/compile_path/src/code.byted.org/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x91 (0x7f6aa3f5fdd1 in /usr/local/lib/python3.11/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1022831 (0x7f6aa4fee831 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22a (0x7f6aa500d27a in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x22e (0x7f6aa500d92e in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x143 (0x7f6aa500f5f3 in /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xd44a3 (0x7f6a96cba4a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x89144 (0x7f6ae9427144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x1097dc (0x7f6ae94a77dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

**System info (please complete the following information):**
 - Python version: Python 3.11.2
 - Pytorch version: 2.5.1+cu124
 - Deepspeed version: 0.16.3
 - Any other relevant info about your setup

**Launcher context**
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=8 --master_port=12345 train_deepspeed.py  --xxx


**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] deepspeed zero2 training hangon and timeout after a fixed step #7044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] deepspeed zero2 training hangon and timeout after a fixed step #7044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions