Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected) #489

Closed
shoegazerstella opened this issue Aug 13, 2020 · 6 comments

Comments

@shoegazerstella
Copy link

Hi,
I am trying to re-train the synthesizer model as discussed in #449 (comment), but I get this error below:

Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   10k Steps    |     32     |     0.001     |        7         |
+----------------+------------+---------------+------------------+
 
Traceback (most recent call last):
  File "synthesizer_train.py", line 33, in <module>
    train(**vars(args))
  File "/root/voicecloning/synthesizer/train.py", line 168, in train
    m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m, e)
  File "/root/voicecloning/synthesizer/utils/__init__.py", line 17, in data_parallel_workaround
    outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/voicecloning/synthesizer/models/tacotron.py", line 348, in forward
    self.step += 1
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 558, in __setattr__
    .format(torch.typename(value), name))
TypeError: cannot assign 'torch.cuda.LongTensor' as parameter 'step' (torch.nn.Parameter or None expected)
@ghost
Copy link

ghost commented Aug 13, 2020

@shoegazerstella I was not able to test parallel GPU training during development since I don't have that kind of hardware. You can add this code to the top of synthesizer_train.py to make it only run on a single GPU for now.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set to the GPU you want to use

@shoegazerstella
Copy link
Author

Thanks a lot!
This solves the issue and now I am able to start the training phase.

@ghost
Copy link

ghost commented Aug 13, 2020

@shoegazerstella I'd like to try fixing this. It will also make your training faster if it works. When you get a chance, could you try changing these lines in synthesizer/model/tacotron.py. Then comment out the os.environ["CUDA_VISIBLE_DEVICES"] = "0" in synthesizer_train.py and see if multi-GPU training will work?

If it doesn't fix the problem you should revert the change because I noticed a slight speed improvement with the current code.

Old

self.step = nn.Parameter(torch.zeros(1).long(), requires_grad=False)
self.stop_threshold = nn.Parameter(torch.tensor(stop_threshold).float32(), requires_grad=False)

New

self.register_buffer('step', torch.zeros(1, dtype=torch.long))
self.register_buffer('stop_threshold', torch.tensor(stop_threshold, dtype=torch.float32))

I made this change since Corentin did something similar when he converted fatchord's vocoder, but now I am wondering if it breaks multi-GPU.

https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/7760081087b57b1a953525ac0bca6213879d2cea#diff-aae6b44cd4ebc2321fee5d9ef4c851ef

@ghost ghost reopened this Aug 13, 2020
@shoegazerstella
Copy link
Author

Awesome! It seems it's working. There's just a new warning I am reporting if you need it for reference:

    models_dir:      DATA/models
    save_every:      1000
    backup_every:    25000
    force_restart:   False

Checkpoint path: DATA/models/synthesizer_model/pretrained.pt
Loading training data from: DATA/SV2TTS/synthesizer/train.txt
Using model: Tacotron
Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 24.888M

Starting the training of Tacotron from scratch

Using inputs from:
        DATA/SV2TTS/synthesizer/train.txt
        DATA/SV2TTS/synthesizer/mels
        DATA/SV2TTS/synthesizer/embeds
LEN METADATA 24353
Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   10k Steps    |     32     |     0.001     |        7         |
+----------------+------------+---------------+------------------+

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py:211: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  self.dropout, self.training, self.bidirectional, self.batch_first)
| Epoch: 1/14 (41/762) | Loss: 1.793 | 0.92 steps/s | Step: 0k | 

Note that if I load the model created with the old code I have this error, I had to train from scratch in order to make it work.

Loading weights at DATA/models/synthesizer_model/synthesizer_model.pt
Traceback (most recent call last):
  File "synthesizer_train.py", line 35, in <module>
    train(**vars(args))
  File "/root/voicecloning/synthesizer/train.py", line 105, in train
    model.load(weights_fpath, optimizer)
  File "/root/voicecloning/synthesizer/models/tacotron.py", line 497, in load
    optimizer.load_state_dict(checkpoint["optimizer_state"])
  File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 115, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

@ghost
Copy link

ghost commented Aug 13, 2020

Nice! I pushed the fix. About the warning, you should compare training speed for single GPU and multi-GPU to make sure it is not adding too much overhead.

Next, monitor the GPU and memory usage with nvidia-smi, you can watch -n 0.5 nvidia-smi to constantly refresh. You can adjust batch size until your GPU memory is filled (leave 20% as mem usage increases during training). It is safe to stop and resume training.

I would recommend not getting too attached to your first few models, the time is better spent learning what works and adjusting the training schedule for maximum efficiency. Train it to 20k steps, listen to the wavs and look at the plots, and try it in the toolbox (set your expectations low until it gets to 100k steps).

@ghost
Copy link

ghost commented Aug 13, 2020

Going to close this issue, please share updates in #449. Thanks @shoegazerstella !

@ghost ghost closed this as completed Aug 13, 2020
@ghost ghost mentioned this issue Feb 16, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant