You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about the training of the model.
I encountered the following error in my training.
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 241, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 359, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[139681], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).
I observed that they disappeared after 16 epochs of training.
Then i try training again,
When the training reached 40 epoch, it stopped again.
Why is this?
Best regards.
The text was updated successfully, but these errors were encountered:
Hi,there
I have a question about the training of the model.
I encountered the following error in my training.
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 241, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], File "/workspace/tts/MB-iSTFT-VITS2/train.py", line 359, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/opt/miniconda3/envs/vits/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Detected mismatch between collectives on ranks. Rank 1 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[139681], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE).
I observed that they disappeared after 16 epochs of training.
Then i try training again,
When the training reached 40 epoch, it stopped again.
Why is this?
Best regards.
The text was updated successfully, but these errors were encountered: