Generate audio from mag spectrogram #3

tunnermann · 2019-07-13T13:31:33Z

Hey, thanks for your work in this project, it is really good.

I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.

The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.

I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.

Any leads on what I might try next?

bshall · 2019-07-14T07:53:47Z

Hi @tunnermann, no problem.

I've just done a bit of testing. Passing a mel spectrogram with num_fft = 1024 to the pretrained model does result in some distortion of the audio. However, when I changed num_fft in the config.json and retrained the model from scratch I got fairly good results.
Here are some samples: samples.zip.

Did you do anything else besides changing the one line in config.json?

Also, I'd be happy to share the weights for this model with you if you'd like?

tunnermann · 2019-07-15T21:44:01Z

@bshall Thanks for your reply.

I did retrain the model with the new n_fft and got good results generating audio from wav files. Maybe my problem is in converting my spectrogram into mel spectrograms and feeding it to the network. I will investigate it further and also retrain the network directly with the generated spectrograms instead of spectrograms derived from the ground truth audio.

Thanks again.

bshall · 2019-07-16T08:11:15Z

Yeah, that sounds like a reasonable approach. Let me know how it goes or if I can help at all. You can also try finetuning the model on the generated spectrograms. Might make experimenting a little faster.

Approximetal · 2020-04-13T08:24:41Z

Hi,@bshall @tunnermann I met the same problem, when I use different parameters to extract mel spectrogram and retrain the model, but the loss stop arround 2.9 and the result has load noise. What can I do to adjust the model to get a better performance?
Here is my config parameters and audio samples. I use several dataset incluing multiple langualges.
"preprocessing": { "sample_rate": 16000, "num_fft": 1024, "num_mels": 80, "fmin": 40, "preemph": 0.97, "min_level_db": -100, "hop_length": 256, "win_length": 1024, "bits": 9, "num_evaluation_utterances" : 10 }, "vocoder": { "conditioning_channels": 128, "embedding_dim": 256, "rnn_channels": 896, "fc_channels": 512, "learning_rate": 1e-4, "schedule": { "step_size": 20000, "gamma": 0.5 }, "batch_size": 256, "checkpoint_interval": 10000, "num_steps": 5000000, "sample_frames": 40, "audio_slice_frames": 8 }
audio_samples.zip

bshall · 2020-04-14T08:28:34Z

Hi @Approximetal,

My guess is that a hop-length of 256 is too large for a sample rate of 16kHz. At this hop-length each frame is 16ms of audio. Most TTS and vocoder implementations that I've seen use either 12.5ms or 10ms. The ones that use a hop-length of 256 typically have audio at a sample rate of 22050.

The ZeroSpeech2019 dataset is only recorded at 16kHz so my default was a hop-length of 200 (12.5ms).

Hope that helps!

bshall mentioned this issue Jul 29, 2020

About Speaker Voice #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate audio from mag spectrogram #3

Generate audio from mag spectrogram #3

tunnermann commented Jul 13, 2019

bshall commented Jul 14, 2019

tunnermann commented Jul 15, 2019

bshall commented Jul 16, 2019

Approximetal commented Apr 13, 2020 •

edited

Loading

bshall commented Apr 14, 2020

Generate audio from mag spectrogram #3

Generate audio from mag spectrogram #3

Comments

tunnermann commented Jul 13, 2019

bshall commented Jul 14, 2019

tunnermann commented Jul 15, 2019

bshall commented Jul 16, 2019

Approximetal commented Apr 13, 2020 • edited Loading

bshall commented Apr 14, 2020

Approximetal commented Apr 13, 2020 •

edited

Loading