rank error #15

zhanglina94 · 2023-11-30T06:32:20Z

Hi，there

I have a question about the training of the model.

I encountered the following error in my training.

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 241, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 359, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[139681], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).

I observed that they disappeared after 16 epochs of training.
Then i try training again,
When the training reached 40 epoch, it stopped again.
Why is this?

Best regards.

The text was updated successfully, but these errors were encountered:

FENRlR · 2023-12-02T05:49:59Z

I really have no clue about reproducing this error. It seems, however, someone had already encountered situations of such before ([PS2]).

zhanglina94 · 2023-12-03T06:43:18Z

Thanks for your reply,

This error occurs when my gpu is occupied by other processes, something I haven't encountered before, and I'm not sure why it occurs~

And that blog is mine, I try to retrain him it will train again, but after training it will encounter the same problem again~~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rank error #15

rank error #15

zhanglina94 commented Nov 30, 2023

FENRlR commented Dec 2, 2023 •

edited

Loading

zhanglina94 commented Dec 3, 2023

rank error #15

rank error #15

Comments

zhanglina94 commented Nov 30, 2023

FENRlR commented Dec 2, 2023 • edited Loading

zhanglina94 commented Dec 3, 2023

FENRlR commented Dec 2, 2023 •

edited

Loading