Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rank error #15

Open
zhanglina94 opened this issue Nov 30, 2023 · 2 comments
Open

rank error #15

zhanglina94 opened this issue Nov 30, 2023 · 2 comments

Comments

@zhanglina94
Copy link

Hi,there

I have a question about the training of the model.

I encountered the following error in my training.

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 241, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 359, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[139681], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).

I observed that they disappeared after 16 epochs of training.
Then i try training again,
When the training reached 40 epoch, it stopped again.
Why is this?

Best regards.

@FENRlR
Copy link
Owner

FENRlR commented Dec 2, 2023

I really have no clue about reproducing this error. It seems, however, someone had already encountered situations of such before ([PS2]).

@zhanglina94
Copy link
Author

Thanks for your reply,

This error occurs when my gpu is occupied by other processes, something I haven't encountered before, and I'm not sure why it occurs~

And that blog is mine, I try to retrain him it will train again, but after training it will encounter the same problem again~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants