Harshness of speech #10

adriandewitts · 2024-09-23T00:58:46Z

Hi there,

Thank you all for the great work on Optispeech. @w11wo and I have been getting some great results.

I have a sound engineering background, and I’ve got a good idea of the different kinds of issues related to voice quality. I’d like to help and contribute from that perspective.

Optispeech doesn’t have the usual artefacts we’ve heard before, which has been great.

I’ve noticed the “harshness” of our trained voices. This screenshot from the Audacity spectrogram shows this. The bottom window is the original recorded voice, and the top is from Optispeech.

The screenshot points out one example of the ’s’ sound or sibilant sound. In the original, you can see that there is a gentle rise and fall of higher frequencies from 4-18k. Optispeech’s character is loud sibilance across the spectrum with a pronounced start and stop. You can make out the sibilant sounds in the speech.

Please let me know if you have any questions or if there is any way that I can help the project in general.

w11wo · 2024-09-23T01:14:47Z

@mush42. I think this is related to the reports here #2 (comment). FYI we're also using the new ensemble pitch extractor as suggested in that same issue thread.

I'll also attach some mel-spectrogram samples from the data/ folder (result of feature extraction), in case that helps.

And their params

sr = 44100, n_fft = 2048, n_mels = 80, f_min = 0, f_max = 22050

mush42 · 2024-09-23T03:02:28Z

@adriandewitts @w11wo

Thanks all for your feedback.

There are a few suspects here worth investigating:

Data preprocessing: do you use any of the audio filters (low/high pass, preemphasis..etc.)
A sub discriminator: I added a new sub-discriminator (mssbcqtd) designed for hifigan. It caused some convergence issues with a lightspeech model.
The vocoder: except for lightspeech, I'm using the default params of vocos/WaveNeXt, maybe we need to fine-tune that?
another cause?

w11wo · 2024-09-23T05:03:28Z

Hi @mush42, thanks for the response. Here is the config that we've used:

https://github.com/bookbot-hive/optispeech/blob/main/configs/data/feature_extractor/48khz.yaml

We didn't add any filters/changes from the default feature extractors other than FFT, window, f min and max, so I assume it's not because of these.

I've also tried (2), but it did cause NaN loss and thus diverged, so I've turned it off as well.

Curious about (3), but I wanted to know if there's anything wrong with our preprocessing.

mush42 · 2024-09-29T10:51:49Z

@w11wo @adriandewitts any new findings?

w11wo · 2024-09-29T23:42:57Z

Hi @mush42, we haven't found a solution to this issue yet. As I've mentioned, we didn't do any audio preprocessing such as low/high pass filters, etc., and we've used the default configs (other than bumping to 44.1kHz). I was wondering if you had any solutions in mind.

Also, I didn't find this issue when I trained FastSpeech2 + MB-MelGAN, perhaps there could be ideas from there?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harshness of speech #10

Harshness of speech #10

adriandewitts commented Sep 23, 2024

w11wo commented Sep 23, 2024

mush42 commented Sep 23, 2024

w11wo commented Sep 23, 2024

mush42 commented Sep 29, 2024

w11wo commented Sep 29, 2024

Harshness of speech #10

Harshness of speech #10

Comments

adriandewitts commented Sep 23, 2024

w11wo commented Sep 23, 2024

mush42 commented Sep 23, 2024

w11wo commented Sep 23, 2024

mush42 commented Sep 29, 2024

w11wo commented Sep 29, 2024