-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate audio from mag spectrogram #3
Comments
Hi @tunnermann, no problem. I've just done a bit of testing. Passing a mel spectrogram with Did you do anything else besides changing the one line in Also, I'd be happy to share the weights for this model with you if you'd like? |
@bshall Thanks for your reply. I did retrain the model with the new n_fft and got good results generating audio from wav files. Maybe my problem is in converting my spectrogram into mel spectrograms and feeding it to the network. I will investigate it further and also retrain the network directly with the generated spectrograms instead of spectrograms derived from the ground truth audio. Thanks again. |
Yeah, that sounds like a reasonable approach. Let me know how it goes or if I can help at all. You can also try finetuning the model on the generated spectrograms. Might make experimenting a little faster. |
Hi,@bshall @tunnermann I met the same problem, when I use different parameters to extract mel spectrogram and retrain the model, but the loss stop arround 2.9 and the result has load noise. What can I do to adjust the model to get a better performance? |
Hi @Approximetal, My guess is that a hop-length of 256 is too large for a sample rate of 16kHz. At this hop-length each frame is 16ms of audio. Most TTS and vocoder implementations that I've seen use either 12.5ms or 10ms. The ones that use a hop-length of 256 typically have audio at a sample rate of 22050. The ZeroSpeech2019 dataset is only recorded at 16kHz so my default was a hop-length of 200 (12.5ms). Hope that helps! |
Hey, thanks for your work in this project, it is really good.
I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.
The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.
I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.
Any leads on what I might try next?
The text was updated successfully, but these errors were encountered: