Cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected) #489

shoegazerstella · 2020-08-13T08:51:58Z

Hi,
I am trying to re-train the synthesizer model as discussed in #449 (comment), but I get this error below:

Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   10k Steps    |     32     |     0.001     |        7         |
+----------------+------------+---------------+------------------+
 
Traceback (most recent call last):
  File "synthesizer_train.py", line 33, in <module>
    train(**vars(args))
  File "/root/voicecloning/synthesizer/train.py", line 168, in train
    m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m, e)
  File "/root/voicecloning/synthesizer/utils/__init__.py", line 17, in data_parallel_workaround
    outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/voicecloning/synthesizer/models/tacotron.py", line 348, in forward
    self.step += 1
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 558, in __setattr__
    .format(torch.typename(value), name))
TypeError: cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected)

The text was updated successfully, but these errors were encountered:

ghost · 2020-08-13T09:03:53Z

@shoegazerstella I was not able to test parallel GPU training during development since I don't have that kind of hardware. You can add this code to the top of synthesizer_train.py to make it only run on a single GPU for now.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set to the GPU you want to use

shoegazerstella · 2020-08-13T09:07:45Z

Thanks a lot!
This solves the issue and now I am able to start the training phase.

ghost · 2020-08-13T09:35:07Z

@shoegazerstella I'd like to try fixing this. It will also make your training faster if it works. When you get a chance, could you try changing these lines in synthesizer/model/tacotron.py. Then comment out the os.environ["CUDA_VISIBLE_DEVICES"] = "0" in synthesizer_train.py and see if multi-GPU training will work?

If it doesn't fix the problem you should revert the change because I noticed a slight speed improvement with the current code.

Old

self.step = nn.Parameter(torch.zeros(1).long(), requires_grad=False)
self.stop_threshold = nn.Parameter(torch.tensor(stop_threshold).float32(), requires_grad=False)

New

self.register_buffer('step', torch.zeros(1, dtype=torch.long))
self.register_buffer('stop_threshold', torch.tensor(stop_threshold, dtype=torch.float32))

I made this change since Corentin did something similar when he converted fatchord's vocoder, but now I am wondering if it breaks multi-GPU.

https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/7760081087b57b1a953525ac0bca6213879d2cea#diff-aae6b44cd4ebc2321fee5d9ef4c851ef

shoegazerstella · 2020-08-13T09:44:35Z

Awesome! It seems it's working. There's just a new warning I am reporting if you need it for reference:

    models_dir:      DATA/models
    save_every:      1000
    backup_every:    25000
    force_restart:   False

Checkpoint path: DATA/models/synthesizer_model/pretrained.pt
Loading training data from: DATA/SV2TTS/synthesizer/train.txt
Using model: Tacotron
Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 24.888M

Starting the training of Tacotron from scratch

Using inputs from:
        DATA/SV2TTS/synthesizer/train.txt
        DATA/SV2TTS/synthesizer/mels
        DATA/SV2TTS/synthesizer/embeds
LEN METADATA 24353
Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   10k Steps    |     32     |     0.001     |        7         |
+----------------+------------+---------------+------------------+

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py:211: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  self.dropout, self.training, self.bidirectional, self.batch_first)
| Epoch: 1/14 (41/762) | Loss: 1.793 | 0.92 steps/s | Step: 0k |

Note that if I load the model created with the old code I have this error, I had to train from scratch in order to make it work.

Loading weights at DATA/models/synthesizer_model/synthesizer_model.pt
Traceback (most recent call last):
  File "synthesizer_train.py", line 35, in <module>
    train(**vars(args))
  File "/root/voicecloning/synthesizer/train.py", line 105, in train
    model.load(weights_fpath, optimizer)
  File "/root/voicecloning/synthesizer/models/tacotron.py", line 497, in load
    optimizer.load_state_dict(checkpoint["optimizer_state"])
  File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 115, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

ghost · 2020-08-13T10:08:34Z

Nice! I pushed the fix. About the warning, you should compare training speed for single GPU and multi-GPU to make sure it is not adding too much overhead.

Next, monitor the GPU and memory usage with nvidia-smi, you can watch -n 0.5 nvidia-smi to constantly refresh. You can adjust batch size until your GPU memory is filled (leave 20% as mem usage increases during training). It is safe to stop and resume training.

I would recommend not getting too attached to your first few models, the time is better spent learning what works and adjusting the training schedule for maximum efficiency. Train it to 20k steps, listen to the wavs and look at the plots, and try it in the toolbox (set your expectations low until it gets to 100k steps).

ghost · 2020-08-13T10:09:02Z

Going to close this issue, please share updates in #449. Thanks @shoegazerstella !

shoegazerstella closed this as completed Aug 13, 2020

ghost reopened this Aug 13, 2020

ghost closed this as completed Aug 13, 2020

ghost mentioned this issue Feb 16, 2021

Can't train on two GPU's #664

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected) #489

Cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected) #489

shoegazerstella commented Aug 13, 2020

ghost commented Aug 13, 2020 •

edited by ghost

Loading

shoegazerstella commented Aug 13, 2020

ghost commented Aug 13, 2020

shoegazerstella commented Aug 13, 2020

ghost commented Aug 13, 2020

ghost commented Aug 13, 2020

Cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected) #489

Cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected) #489

Comments

shoegazerstella commented Aug 13, 2020

ghost commented Aug 13, 2020 • edited by ghost Loading

shoegazerstella commented Aug 13, 2020

ghost commented Aug 13, 2020

Old

New

shoegazerstella commented Aug 13, 2020

ghost commented Aug 13, 2020

ghost commented Aug 13, 2020

ghost commented Aug 13, 2020 •

edited by ghost

Loading