Inference/recipe not working properly. #640

BillyBobQuebec · 2021-07-11T04:03:10Z

Discussed in #639

^{Originally posted by BillyBobQuebec July 10, 2021}
I am training Tacotron2-DDC (LJ) from scratch using the recipe provided with no changes, tensorboard looks good to my eyes, but the alignment and duration seem to be way off when I actually try inferencing the audio, I'm suspecting it's a problem with it being inferenced improperly. specifically, the r-value that it is attempting to inference with. Here is the command that I used to initiate training:

cd ~/repo/coqui-clean
bash recipes/ljspeech/tacotron2-DDC/run.sh

Since the recipe uses gradual training which uses "r" as the starting value for the fine decoder (if I understand it correctly), but then changes it over time during training, I suspect it's using the starting r value during inference, instead of the latest r-value the fine decoder was at during training, when I try to force a different r value for inference (by passing through a recipe config with "r": 2, instead of "r": 6, ) it gives me this error:

RuntimeError: Error(s) in loading state_dict for Tacotron2:
size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([480, 1536]) from checkpoint, the shape in current model is torch.Size([160, 1536]).
size mismatch for decoder.linear_projection.linear_layer.bias: copying a param with shape torch.Size([480]) from checkpoint, the shape in current model is torch.Size([160]).
size mismatch for decoder.stopnet.1.linear_layer.weight: copying a param with shape torch.Size([1, 1504]) from checkpoint, the shape in current model is torch.Size([1, 1184]).

Here's the command used for inferencing and here's how it sounds at different points:

cd ~
cp ~/repo/coqui-clean/recipes/ljspeech/tacotron2-DDC/scale_stats.npy .
cp ~/repo/coqui-clean/recipes/ljspeech/tacotron2-DDC/tacotron2-DDC.json config.json
CUDA_VISIBLE_DEVICES="" tts \
  --text "Hello I bought this T.V. today, and it's cold outside. I should probably grab my sweater and go to your moms house." \
  --model_path ~/repo/coqui-clean/recipes/ljspeech/tacotron2-DDC/ljspeech-ddc-July-06-2021_09+10AM-8fbadad6/checkpoint_280000.pth.tar \
  --config_path config.json \
  --out_path output.wav

280k.mp4

110k.mp4

50k.mp4

The text was updated successfully, but these errors were encountered:

erogol · 2021-07-11T20:03:31Z

You can check if it is about r. You can init the model with the default r value then change it to 2 and run the inference.

BillyBobQuebec · 2021-07-12T00:23:19Z

Ok I have done that now, it appears that the r-value is not the issue as it sounds the same, is it normal for inferences to sound like this at these points in training? or do you think there is another issue possibly occurring?

erogol · 2021-07-12T09:00:13Z

Try disabling mixed_precision in the config file. It causes issues on some systems.

Also, you can check the working released models and try copying their config for your run. Maybe I missed something as I was updating TTS with the new Trainer API.

BillyBobQuebec · 2021-07-12T14:38:23Z

The recipe that was used already has mixed-precision set to false, I tried using the pre-trained model's config for this model instead of the recipe config, I attempted to inference in a variety of ways with this other config to make it work, but was only able to get this.

size mismatch for embedding.weight: copying a param with shape torch.Size([182, 512]) from checkpoint, the shape in current model is torch.Size([64, 512]).
size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([480, 1536]) from checkpoint, the shape in current model is torch.Size([160, 1536]).
size mismatch for decoder.linear_projection.linear_layer.bias: copying a param with shape torch.Size([480]) from checkpoint, the shape in current model is torch.Size([160]).
size mismatch for decoder.stopnet.1.linear_layer.weight: copying a param with shape torch.Size([1, 1504]) from checkpoint, the shape in current model is torch.Size([1, 1184]).

Update:
I was able to fix all the size mismatch errors by making these changes to the pretrained config: (Enable "double_decoder_consistency", remove "characters", set "r": 6, and set "ddc_r": 6) the resulting audio still sounded identical to what I attached at the top of this thread.

erogol · 2021-07-12T15:41:59Z

you need to retrain with the new config, especially if there are different audio parameters. It is not enough to change it for only inference.

BillyBobQuebec · 2021-07-12T16:45:23Z

Okay, but the pretrained config has double decoder consistency disabled for some reason... so do I enable that and keep everything else the same in the config?

erogol · 2021-07-13T12:33:28Z

you can enable it and keep the rest the same.

It is disabled since the 2nd decoder is removed to reduce the model size.

BillyBobQuebec · 2021-07-15T18:31:04Z

Okay attempting that now, at what point would you recommend attempting inference and checking audio quality? total epochs is 1,000 epochs.

erogol · 2021-07-16T07:01:09Z

In general, after 20k steps, it should start producing understandable speech.

BillyBobQuebec · 2021-07-16T23:29:58Z

In general, after 20k steps, it should start producing understandable speech.

It's now at about 100K steps (Training with the pretrained model config, only change made is setting DDC to true.)
Here is how it sounds at 20K, 60K, and 80K steps. Is your statement about producing understandable speech in reference to gradual training specifically perhaps? Because the pretrained config does not have gradual training and seems to be training slower than expected. Here is the tensorboard if that helps: https://tensorboard.dev/experiment/Ku5pmxY2QVWrasO3SQ7mBQ/#scalars

output20k.mp4

output60k.mp4

output80k.mp4

erogol · 2021-07-20T07:57:06Z

Why there is no image on the Tensorboard? there should be alignment images.

erogol · 2021-07-20T07:59:30Z

But looking at the audio samples, there looks to be something is broken in the model or the configs you use.

Also pls share the alignment images too, to see if the bug is in the inference or the training code.

BillyBobQuebec · 2021-07-20T15:25:05Z

Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?: tensorboard dev upload --logdir .

mbarnig · 2021-07-21T17:11:45Z

Since two weeks I have the same type of problems with Tacotron2-DDC Inference. My trained models with version 0.1.2 looks fine in Tensorboard and the audio in Tensorboard is intelligible, but the inference audio is broken. Until now I searched for errors in my settings, but the present issue description by Billy Bob makes me think that there is really a problem with inference. My understanding is that the models released in the past should work with the latest Coqui-TTS versions. Therefore I did some inference tests with the Tacotron2-DDC LJSpeech model, released in April 2021.

I used the following script

tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
--model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
--config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
--out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_version.wav

and started with version 0.0.12 (git checkout a53958a). It works as expected.

Here are the logs, the signal-figure and the sound:

(coqui-tts) mbarnig@mbarnig-MS-7B22:~/coqui-tts/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.0.12.wav 
 > Downloading model to /home/mbarnig/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC
 > Downloading model to /home/mbarnig/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2
 > Using model: Tacotron2
 > Generator Model: hifigan_generator
Removing weight norm...
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Processing time: 2.3137736320495605
 > Real-time factor: 0.3193459475882124
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.0.12.wav

ljspeech_v0.0.12.mp4

Version 0.0.13 (git checkout f02f033) works also fine.

In version 0.0.14 (git checkout 5482a0f) the following error is reported:

File "/home/mbarnig/coqui-tts/lib/python3.8/site-packages/coqpit/coqpit.py", line 856, in check_argument
    assert os.path.exists(c[name]), f' [!] path for {name} ("{c[name]}") does not exist.'
AssertionError:  [!] path for pad ("") does not exist.

I was not able to debug this problem and could not check if the inference is working.

Version 0.0.15 (git checkout b8b79a5) shows no errors in the logs, but the sound is bad.

ljspeech_v0.0.15.mp4

Same results for version 0.0.15.1 (git checkout d245b5d)

Versions 0.1.0 (git checkout c25a218), 0.1.1 (git checkout 676d22f), 0.1.2 (git checkout 8fbadad) and main show a warning that the decoder stopped with 'max_decoder_steps' 500. In all cases the sound is broken, as in version 0.0.15.

Here are the logs, the signal-figure and the sound for the latest version 0.1.2:

(coqui-tts) mbarnig@mbarnig-MS-7B22:~/coqui-tts/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.1.2.wav 
 > Using model: Tacotron2
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 2.9582419395446777
 > Real-time factor: 0.47355409140841087
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.1.2.wav

ljspeech_v0.1.2.mp4

I hope my report helps to solve the problem.

erogol · 2021-07-21T18:16:38Z

Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?: tensorboard dev upload --logdir .

yes it should show all these.

Why don't use just run tensorboard locally ? Maybe uploading breaks things

erogol · 2021-07-21T18:20:40Z

@mbarnig very helpful !! Thx for going under the hood.

So it looks like we have something wrong after 0.15

I'll check and try to find that little 🐛

mbarnig · 2021-07-22T11:21:19Z

To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings:

GlowTTS LJSpeech

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/GlowTTS-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/GlowTTS-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
 > Using model: glow_tts
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 3.61332106590271
 > Real-time factor: 0.4497072242343694
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav

glowtts-ljspeech_v0.1.2.mp4

Tacotron2-DCA LJSpeech

I changed the stats_path in the config-file to adapt to my environment.

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
 > Using model: Tacotron2
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 2.646209239959717
 > Real-time factor: 0.384968553659821
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav

tacotron2-dca-ljspeech_v0.1.2.mp4

I think that this audio has also some problems, but I was not able to compare it with the the released version.

SpeedySpeech LJSpeech

I changed the stats_path in the config-file to adapt to my environment. The inference script

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/speedyspeech-ljspeech_v0.1.2.wav
 > Using model: speedy_speech
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/speedy_speech.py", line 310, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
	Missing key(s) in state_dict: "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.weight", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.bias", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.norm.weight",
	
..........................
	
"decoder.decoder.postnet.4.weight", "decoder.decoder.postnet.4.bias", "decoder.decoder.postnet.6.weight", "decoder.decoder.postnet.6.bias", "decoder.decoder.postnet.0.weight", "decoder.decoder.postnet.0.bias".

fails with a Missing key(s) in state_dict RuntimeError.

SC-GlowTTS VCTK

I changed the stats_path in the config-file to my environment and added the speaker_idx to the script. The inference

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/config.json \
> --speaker_idx p225 \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/sc-glowtts-vctk_v0.1.2.wav
 > Using model: glow_tts
 > Training with 0 speakers: 
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/glow_tts.py", line 386, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
	Missing key(s) in state_dict: "emb_g.weight". 
	Unexpected key(s) in state_dict: "decoder.flows.2.wn.cond_layer.bias", "decoder.flows.2.wn.cond_layer.weight_g", "decoder.flows.2.wn.cond_layer.weight_v", "decoder.flows.5.wn.cond_layer.bias", "decoder.flows.5.wn.cond_layer.weight_g", "decoder.flows.5.wn.cond_layer.weight_v", "decoder.flows.8.wn.cond_layer.bias", "decoder.flows.8.wn.cond_layer.weight_g", "decoder.flows.8.wn.cond_layer.weight_v", "decoder.flows.11.wn.cond_layer.bias", "decoder.flows.11.wn.cond_layer.weight_g", "decoder.flows.11.wn.cond_layer.weight_v", "decoder.flows.14.wn.cond_layer.bias", "decoder.flows.14.wn.cond_layer.weight_g", "decoder.flows.14.wn.cond_layer.weight_v", "decoder.flows.17.wn.cond_layer.bias", "decoder.flows.17.wn.cond_layer.weight_g", "decoder.flows.17.wn.cond_layer.weight_v", "decoder.flows.20.wn.cond_layer.bias", "decoder.flows.20.wn.cond_layer.weight_g", "decoder.flows.20.wn.cond_layer.weight_v", "decoder.flows.23.wn.cond_layer.bias", "decoder.flows.23.wn.cond_layer.weight_g", "decoder.flows.23.wn.cond_layer.weight_v", "decoder.flows.26.wn.cond_layer.bias", "decoder.flows.26.wn.cond_layer.weight_g", "decoder.flows.26.wn.cond_layer.weight_v", "decoder.flows.29.wn.cond_layer.bias", "decoder.flows.29.wn.cond_layer.weight_g", "decoder.flows.29.wn.cond_layer.weight_v", "decoder.flows.32.wn.cond_layer.bias", "decoder.flows.32.wn.cond_layer.weight_g", "decoder.flows.32.wn.cond_layer.weight_v", "decoder.flows.35.wn.cond_layer.bias", "decoder.flows.35.wn.cond_layer.weight_g", "decoder.flows.35.wn.cond_layer.weight_v". 
	size mismatch for encoder.duration_predictor.conv_1.weight: copying a param with shape torch.Size([256, 448, 3]) from checkpoint, the shape in current model is torch.Size([256, 192, 3]).

fails with Errors in loading state_dict RuntimeError.

Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors.

mbarnig · 2021-07-22T12:27:52Z

To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings:

GlowTTS LJSpeech

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/GlowTTS-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/GlowTTS-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
 > Using model: glow_tts
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 3.61332106590271
 > Real-time factor: 0.4497072242343694
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav

glowtts-ljspeech_v0.1.2.mp4

Tacotron2-DCA LJSpeech

I changed the stats_path in the config-file to adapt to my environment.

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
 > Using model: Tacotron2
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 2.646209239959717
 > Real-time factor: 0.384968553659821
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav

tacotron2-dca-ljspeech_v0.1.2.mp4

I think that this audio has also some problems, but I was not able to compare it with the the released version.

SpeedySpeech LJSpeech

I changed the stats_path in the config-file to adapt to my environment. The inference script

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/speedyspeech-ljspeech_v0.1.2.wav
 > Using model: speedy_speech
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/speedy_speech.py", line 310, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
	Missing key(s) in state_dict: "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.weight", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.bias", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.norm.weight",
	
..........................
	
"decoder.decoder.postnet.4.weight", "decoder.decoder.postnet.4.bias", "decoder.decoder.postnet.6.weight", "decoder.decoder.postnet.6.bias", "decoder.decoder.postnet.0.weight", "decoder.decoder.postnet.0.bias".

fails with a Missing key(s) in state_dict RuntimeError.

SC-GlowTTS VCTK

I changed the stats_path in the config-file to my environment and added the speaker_idx to the script. The inference

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/config.json \
> --speaker_idx p225 \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/sc-glowtts-vctk_v0.1.2.wav
 > Using model: glow_tts
 > Training with 0 speakers: 
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/glow_tts.py", line 386, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
	Missing key(s) in state_dict: "emb_g.weight". 
	Unexpected key(s) in state_dict: "decoder.flows.2.wn.cond_layer.bias", "decoder.flows.2.wn.cond_layer.weight_g", "decoder.flows.2.wn.cond_layer.weight_v", "decoder.flows.5.wn.cond_layer.bias", "decoder.flows.5.wn.cond_layer.weight_g", "decoder.flows.5.wn.cond_layer.weight_v", "decoder.flows.8.wn.cond_layer.bias", "decoder.flows.8.wn.cond_layer.weight_g", "decoder.flows.8.wn.cond_layer.weight_v", "decoder.flows.11.wn.cond_layer.bias", "decoder.flows.11.wn.cond_layer.weight_g", "decoder.flows.11.wn.cond_layer.weight_v", "decoder.flows.14.wn.cond_layer.bias", "decoder.flows.14.wn.cond_layer.weight_g", "decoder.flows.14.wn.cond_layer.weight_v", "decoder.flows.17.wn.cond_layer.bias", "decoder.flows.17.wn.cond_layer.weight_g", "decoder.flows.17.wn.cond_layer.weight_v", "decoder.flows.20.wn.cond_layer.bias", "decoder.flows.20.wn.cond_layer.weight_g", "decoder.flows.20.wn.cond_layer.weight_v", "decoder.flows.23.wn.cond_layer.bias", "decoder.flows.23.wn.cond_layer.weight_g", "decoder.flows.23.wn.cond_layer.weight_v", "decoder.flows.26.wn.cond_layer.bias", "decoder.flows.26.wn.cond_layer.weight_g", "decoder.flows.26.wn.cond_layer.weight_v", "decoder.flows.29.wn.cond_layer.bias", "decoder.flows.29.wn.cond_layer.weight_g", "decoder.flows.29.wn.cond_layer.weight_v", "decoder.flows.32.wn.cond_layer.bias", "decoder.flows.32.wn.cond_layer.weight_g", "decoder.flows.32.wn.cond_layer.weight_v", "decoder.flows.35.wn.cond_layer.bias", "decoder.flows.35.wn.cond_layer.weight_g", "decoder.flows.35.wn.cond_layer.weight_v". 
	size mismatch for encoder.duration_predictor.conv_1.weight: copying a param with shape torch.Size([256, 448, 3]) from checkpoint, the shape in current model is torch.Size([256, 192, 3]).

fails with Errors in loading state_dict RuntimeError.

Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors.

BillyBobQuebec · 2021-07-22T20:06:01Z

yes it should show all these.

Why don't use just run tensorboard locally? Maybe uploading breaks things

I found alignment images and this is how it looks at 180K:

also this is the test audio that I found as well for 187K

july13training_step187k.mp4

Definitely very different audio and finally understandable!

erogol · 2021-07-23T14:28:19Z

The alignment looks good enough. I guess the issue you experience is about a bug in the stopnet (which decides when to stop for the model). I am working on it and will release the fix soon. Until then, just keep training the model and after the release, you can continue training with the new version and fix the stopnet.

BillyBobQuebec · 2021-08-07T04:32:01Z

@erogol Thank you for pushing the new update! I'm currently training with the same config used at the beginning of this issue, except I'm now training using your new V0.1.3 to see if the stop net was the problem and if I can finally replicate the inference quality in the current pretrained voice.

BillyBobQuebec · 2021-08-10T03:40:00Z

Hello @erogol !

So I have made a few interesting observations lately regarding this issue:

When I inference my model that I trained with your pretrained config, along with the stop net fix, I definitely get significantly better results with it compared to the one I trained before the stop net fix, this difference is clear regardless of vocoder used, that being said...

When inferencing this better DDC model while using the pretrained hifigan as the vocoder, the inference quality is significantly worse than both multi-band melgan and griffin-lim, but from my experience, it should have significantly better quality than griffin-lim, and at least similar quality to multi-band melgan.

Here is audio of my model inferencing with Griffin-Lim, Multi-band Melgan, and finally hifi-gan.

50k_default.mp4

50k_multiband_melgan.mp4

Loudness warning

50k_hifigan.mp4

On the contrary, the pretrained DDC model that comes with coqui works perfectly well with hifigan, but seems to be significantly worse audio quality when inferenced with multi-band melgan, it seems that the opposite trend is occurring compared to the DDC model I trained, with the supposedly same config. Funnily enough, the high pitch artifacts and features of the audio seem similar to what happens when I use my own ddc checkpoint on hifigan, Here is audio of that poor inference quality occurring with the Coqui pretrained DCC + multi-band melgan:

output.mp4

So overall, the stop net fix definitely helped, but there is something more occurring here unfortunately, that is preventing my Tacotron2DDC model from reaching the expected quality that Tacotron2DDC + Hifigan has to offer.

erogol · 2021-08-10T07:45:34Z

Thanks for the update.

Vocoder model should match the audio parameters of the tts model. Have you checked ?

BillyBobQuebec · 2021-08-11T21:42:48Z

Ah it seems like that may be the problem, looks like I was using a slightly different config, I also noticed some things that might've affected training. I'm going to train from scratch again and make sure everything is correct, i'll post an update of how inferencing sounds with hifi-gan within the next few days.

BillyBobQuebec · 2021-08-14T14:35:41Z

Ok, I double-checked this time that all audio parameters were the same across both Hifi-gan and Tacotron 2 DDC, also made sure I was using the new Stop net code, along with correctly calculating scale_stats.py which I had some mishaps with before. With all of this I trained a model with the config to about 80K steps and tried inferencing it again with Hifi-gan, unfortunately, that screeching high pitch feature is still strongly present, and it doesn't seem to have improved in that regard.

stale · 2021-09-13T14:44:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

stale bot added the wontfix This will not be worked on but feel free to help. label Sep 13, 2021

stale bot closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference/recipe not working properly. #640

Inference/recipe not working properly. #640

BillyBobQuebec commented Jul 11, 2021

erogol commented Jul 11, 2021

BillyBobQuebec commented Jul 12, 2021

erogol commented Jul 12, 2021

BillyBobQuebec commented Jul 12, 2021 •

edited

Loading

erogol commented Jul 12, 2021 •

edited

Loading

BillyBobQuebec commented Jul 12, 2021

erogol commented Jul 13, 2021

BillyBobQuebec commented Jul 15, 2021 •

edited

Loading

erogol commented Jul 16, 2021

BillyBobQuebec commented Jul 16, 2021

erogol commented Jul 20, 2021

erogol commented Jul 20, 2021

BillyBobQuebec commented Jul 20, 2021

mbarnig commented Jul 21, 2021

erogol commented Jul 21, 2021 •

edited

Loading

erogol commented Jul 21, 2021

mbarnig commented Jul 22, 2021

mbarnig commented Jul 22, 2021

BillyBobQuebec commented Jul 22, 2021

erogol commented Jul 23, 2021

BillyBobQuebec commented Aug 7, 2021

BillyBobQuebec commented Aug 10, 2021 •

edited

Loading

erogol commented Aug 10, 2021 •

edited

Loading

BillyBobQuebec commented Aug 11, 2021

BillyBobQuebec commented Aug 14, 2021

stale bot commented Sep 13, 2021

Inference/recipe not working properly. #640

Inference/recipe not working properly. #640

Comments

BillyBobQuebec commented Jul 11, 2021

Discussed in #639

erogol commented Jul 11, 2021

BillyBobQuebec commented Jul 12, 2021

erogol commented Jul 12, 2021

BillyBobQuebec commented Jul 12, 2021 • edited Loading

erogol commented Jul 12, 2021 • edited Loading

BillyBobQuebec commented Jul 12, 2021

erogol commented Jul 13, 2021

BillyBobQuebec commented Jul 15, 2021 • edited Loading

erogol commented Jul 16, 2021

BillyBobQuebec commented Jul 16, 2021

erogol commented Jul 20, 2021

erogol commented Jul 20, 2021

BillyBobQuebec commented Jul 20, 2021

mbarnig commented Jul 21, 2021

erogol commented Jul 21, 2021 • edited Loading

erogol commented Jul 21, 2021

mbarnig commented Jul 22, 2021

GlowTTS LJSpeech

Tacotron2-DCA LJSpeech

SpeedySpeech LJSpeech

SC-GlowTTS VCTK

mbarnig commented Jul 22, 2021

GlowTTS LJSpeech

Tacotron2-DCA LJSpeech

SpeedySpeech LJSpeech

SC-GlowTTS VCTK

BillyBobQuebec commented Jul 22, 2021

erogol commented Jul 23, 2021

BillyBobQuebec commented Aug 7, 2021

BillyBobQuebec commented Aug 10, 2021 • edited Loading

erogol commented Aug 10, 2021 • edited Loading

BillyBobQuebec commented Aug 11, 2021

BillyBobQuebec commented Aug 14, 2021

stale bot commented Sep 13, 2021

BillyBobQuebec commented Jul 12, 2021 •

edited

Loading

erogol commented Jul 12, 2021 •

edited

Loading

BillyBobQuebec commented Jul 15, 2021 •

edited

Loading

erogol commented Jul 21, 2021 •

edited

Loading

BillyBobQuebec commented Aug 10, 2021 •

edited

Loading

erogol commented Aug 10, 2021 •

edited

Loading