Can't train with fp16 on Nvidia P100 #15

54696d21 · 2021-06-29T10:35:30Z

training with fp16 doesn't work for me on a P100, I'll look into fixing it, but for future reference here is the full stacktrace
torch version 1.9.0

2021-06-29 10:29:09.537741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /pytorch/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
Traceback (most recent call last):
  File "train_ms.py", line 294, in <module>
    main()
  File "train_ms.py", line 50, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/content/vits/train_ms.py", line 118, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/vits/train_ms.py", line 192, in train_and_evaluate
    scaler.scale(loss_gen_all).backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: "fill_cuda" not implemented for 'ComplexHalf'

The text was updated successfully, but these errors were encountered:

54696d21 · 2021-06-29T10:41:54Z

the problem does not occur on torch version 1.6.0 as in the requirements.txt

Selimonder · 2021-06-30T08:37:00Z

Use nvcr.io/nvidia/pytorch:20.07-py docker image.

jesus-villalba · 2021-08-03T16:54:49Z

I have the same issue, any idea of where the complex number is generated?,
(for 1.6 it works fine but I want to combine this with another code that requires pytorch 1.9)

boltzmann-Li · 2021-10-12T03:08:58Z

I got the same issue. It’s due to a bug in the pytorch STFT function for half tensor. The work around is moving the calculation of y_hat_mel in train.py outside autocast, and casting y_hat to float one line above y_hat_mel calculation.

mogwai · 2021-10-25T11:16:18Z

@boltzmann-Li Can you create a PR so we can see that fix. I haven't managed to get it working following your instructions.

FYI the problem hasn't been fixed in torch 1.10.0

Is there an issue for the Complex Half problem?

boltzmann-Li · 2021-10-28T12:46:10Z

@boltzmann-Li Can you create a PR so we can see that fix. I haven't managed to get it working following your instructions.

FYI the problem hasn't been fixed in torch 1.10.0

Is there an issue for the Complex Half problem?

I created a pull request. It has been working for me with 3090 GPUs and torch 1.9

FarisHijazi · 2021-12-30T16:12:06Z

Very helpful @boltzmann-Li

here are the lines https://github.com/boltzmann-Li/vits/blob/5a1f4b7afb8a822f66c0ddc75bc959a44a57d035/train_ms.py#L156-L166

candlewill · 2022-03-31T06:27:08Z

I think a better way to solve this problem is to wrap the torch.stft with autocast(enabled=off) inside the mel_spectrogram_torch function. Here is the code:

def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
    if torch.min(y) < -1.:
        print('min value is ', torch.min(y))
    if torch.max(y) > 1.:
        print('max value is ', torch.max(y))

    global mel_basis, hann_window
    dtype_device = str(y.dtype) + '_' + str(y.device)
    fmax_dtype_device = str(fmax) + '_' + dtype_device
    wnsize_dtype_device = str(win_size) + '_' + dtype_device
    if fmax_dtype_device not in mel_basis:
        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device)
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)

    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
    y = y.squeeze(1)
    with autocast(enabled=False):
        y = y.float()
        spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
                        center=center, pad_mode='reflect', normalized=False, onesided=True)

    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)

    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    spec = spectral_normalize_torch(spec)

    return spec

mogwai mentioned this issue Oct 27, 2021

torch.stft - fill_cuda not implemented for ComplexHalf pytorch/pytorch#67324

Open

ariacat3366 mentioned this issue Apr 13, 2022

loss diverged to nan when training vits with fp16 option enabled. espnet/espnet#4236

Closed

AlexandaJerry mentioned this issue Oct 27, 2022

训练时报错 AlexandaJerry/vits-mandarin-biaobei#2

Open

offside609 mentioned this issue Apr 30, 2023

[Bug] RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument. coqui-ai/TTS#2555

Closed

Laope94 mentioned this issue Jun 6, 2023

Training with lower precision crashes with runtime error rhasspy/piper#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't train with fp16 on Nvidia P100 #15

Can't train with fp16 on Nvidia P100 #15

54696d21 commented Jun 29, 2021 •

edited

Loading

54696d21 commented Jun 29, 2021

Selimonder commented Jun 30, 2021

jesus-villalba commented Aug 3, 2021

boltzmann-Li commented Oct 12, 2021 •

edited

Loading

mogwai commented Oct 25, 2021 •

edited

Loading

boltzmann-Li commented Oct 28, 2021

FarisHijazi commented Dec 30, 2021

candlewill commented Mar 31, 2022

Can't train with fp16 on Nvidia P100 #15

Can't train with fp16 on Nvidia P100 #15

Comments

54696d21 commented Jun 29, 2021 • edited Loading

54696d21 commented Jun 29, 2021

Selimonder commented Jun 30, 2021

jesus-villalba commented Aug 3, 2021

boltzmann-Li commented Oct 12, 2021 • edited Loading

mogwai commented Oct 25, 2021 • edited Loading

boltzmann-Li commented Oct 28, 2021

FarisHijazi commented Dec 30, 2021

candlewill commented Mar 31, 2022

54696d21 commented Jun 29, 2021 •

edited

Loading

boltzmann-Li commented Oct 12, 2021 •

edited

Loading

mogwai commented Oct 25, 2021 •

edited

Loading