-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harshness of speech #10
Comments
@mush42. I think this is related to the reports here #2 (comment). FYI we're also using the new ensemble pitch extractor as suggested in that same issue thread. I'll also attach some mel-spectrogram samples from the And their params
|
Thanks all for your feedback. There are a few suspects here worth investigating:
|
Hi @mush42, thanks for the response. Here is the config that we've used: We didn't add any filters/changes from the default feature extractors other than FFT, window, f min and max, so I assume it's not because of these. I've also tried (2), but it did cause NaN loss and thus diverged, so I've turned it off as well. Curious about (3), but I wanted to know if there's anything wrong with our preprocessing. |
@w11wo @adriandewitts any new findings? |
Hi @mush42, we haven't found a solution to this issue yet. As I've mentioned, we didn't do any audio preprocessing such as low/high pass filters, etc., and we've used the default configs (other than bumping to 44.1kHz). I was wondering if you had any solutions in mind. Also, I didn't find this issue when I trained FastSpeech2 + MB-MelGAN, perhaps there could be ideas from there? |
Hi there,
Thank you all for the great work on Optispeech. @w11wo and I have been getting some great results.
I have a sound engineering background, and I’ve got a good idea of the different kinds of issues related to voice quality. I’d like to help and contribute from that perspective.
Optispeech doesn’t have the usual artefacts we’ve heard before, which has been great.
I’ve noticed the “harshness” of our trained voices. This screenshot from the Audacity spectrogram shows this. The bottom window is the original recorded voice, and the top is from Optispeech.
The screenshot points out one example of the ’s’ sound or sibilant sound. In the original, you can see that there is a gentle rise and fall of higher frequencies from 4-18k. Optispeech’s character is loud sibilance across the spectrum with a pronounced start and stop. You can make out the sibilant sounds in the speech.
Please let me know if you have any questions or if there is any way that I can help the project in general.
The text was updated successfully, but these errors were encountered: