Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate audio from mag spectrogram #3

Open
tunnermann opened this issue Jul 13, 2019 · 5 comments
Open

Generate audio from mag spectrogram #3

tunnermann opened this issue Jul 13, 2019 · 5 comments

Comments

@tunnermann
Copy link

Hey, thanks for your work in this project, it is really good.

I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.

The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.

I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.

Any leads on what I might try next?

@bshall
Copy link
Owner

bshall commented Jul 14, 2019

Hi @tunnermann, no problem.

I've just done a bit of testing. Passing a mel spectrogram with num_fft = 1024 to the pretrained model does result in some distortion of the audio. However, when I changed num_fft in the config.json and retrained the model from scratch I got fairly good results.
Here are some samples: samples.zip.

Did you do anything else besides changing the one line in config.json?

Also, I'd be happy to share the weights for this model with you if you'd like?

@tunnermann
Copy link
Author

@bshall Thanks for your reply.

I did retrain the model with the new n_fft and got good results generating audio from wav files. Maybe my problem is in converting my spectrogram into mel spectrograms and feeding it to the network. I will investigate it further and also retrain the network directly with the generated spectrograms instead of spectrograms derived from the ground truth audio.

Thanks again.

@bshall
Copy link
Owner

bshall commented Jul 16, 2019

Yeah, that sounds like a reasonable approach. Let me know how it goes or if I can help at all. You can also try finetuning the model on the generated spectrograms. Might make experimenting a little faster.

@Approximetal
Copy link

Approximetal commented Apr 13, 2020

Hi,@bshall @tunnermann I met the same problem, when I use different parameters to extract mel spectrogram and retrain the model, but the loss stop arround 2.9 and the result has load noise. What can I do to adjust the model to get a better performance?
Here is my config parameters and audio samples. I use several dataset incluing multiple langualges.
"preprocessing": { "sample_rate": 16000, "num_fft": 1024, "num_mels": 80, "fmin": 40, "preemph": 0.97, "min_level_db": -100, "hop_length": 256, "win_length": 1024, "bits": 9, "num_evaluation_utterances" : 10 }, "vocoder": { "conditioning_channels": 128, "embedding_dim": 256, "rnn_channels": 896, "fc_channels": 512, "learning_rate": 1e-4, "schedule": { "step_size": 20000, "gamma": 0.5 }, "batch_size": 256, "checkpoint_interval": 10000, "num_steps": 5000000, "sample_frames": 40, "audio_slice_frames": 8 }
audio_samples.zip

@bshall
Copy link
Owner

bshall commented Apr 14, 2020

Hi @Approximetal,

My guess is that a hop-length of 256 is too large for a sample rate of 16kHz. At this hop-length each frame is 16ms of audio. Most TTS and vocoder implementations that I've seen use either 12.5ms or 10ms. The ones that use a hop-length of 256 typically have audio at a sample rate of 22050.

The ZeroSpeech2019 dataset is only recorded at 16kHz so my default was a hop-length of 200 (12.5ms).

Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants