Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harshness of speech #10

Open
adriandewitts opened this issue Sep 23, 2024 · 5 comments
Open

Harshness of speech #10

adriandewitts opened this issue Sep 23, 2024 · 5 comments

Comments

@adriandewitts
Copy link

Hi there,

Thank you all for the great work on Optispeech. @w11wo and I have been getting some great results.

I have a sound engineering background, and I’ve got a good idea of the different kinds of issues related to voice quality. I’d like to help and contribute from that perspective.

Optispeech doesn’t have the usual artefacts we’ve heard before, which has been great.

I’ve noticed the “harshness” of our trained voices. This screenshot from the Audacity spectrogram shows this. The bottom window is the original recorded voice, and the top is from Optispeech.

The screenshot points out one example of the ’s’ sound or sibilant sound. In the original, you can see that there is a gentle rise and fall of higher frequencies from 4-18k. Optispeech’s character is loud sibilance across the spectrum with a pronounced start and stop. You can make out the sibilant sounds in the speech.

Please let me know if you have any questions or if there is any way that I can help the project in general.

screenshot_2024-09-23_at_10 35 30___am

@w11wo
Copy link

w11wo commented Sep 23, 2024

@mush42. I think this is related to the reports here #2 (comment). FYI we're also using the new ensemble pitch extractor as suggested in that same issue thread.

I'll also attach some mel-spectrogram samples from the data/ folder (result of feature extraction), in case that helps.

image (1)
image (2)
image

And their params

sr = 44100, n_fft = 2048, n_mels = 80, f_min = 0, f_max = 22050

@mush42
Copy link
Owner

mush42 commented Sep 23, 2024

@adriandewitts @w11wo

Thanks all for your feedback.

There are a few suspects here worth investigating:

  1. Data preprocessing: do you use any of the audio filters (low/high pass, preemphasis..etc.)
  2. A sub discriminator: I added a new sub-discriminator (mssbcqtd) designed for hifigan. It caused some convergence issues with a lightspeech model.
  3. The vocoder: except for lightspeech, I'm using the default params of vocos/WaveNeXt, maybe we need to fine-tune that?
  4. another cause?

@w11wo
Copy link

w11wo commented Sep 23, 2024

Hi @mush42, thanks for the response. Here is the config that we've used:

https://github.com/bookbot-hive/optispeech/blob/main/configs/data/feature_extractor/48khz.yaml

We didn't add any filters/changes from the default feature extractors other than FFT, window, f min and max, so I assume it's not because of these.

I've also tried (2), but it did cause NaN loss and thus diverged, so I've turned it off as well.

Curious about (3), but I wanted to know if there's anything wrong with our preprocessing.

@mush42
Copy link
Owner

mush42 commented Sep 29, 2024

@w11wo @adriandewitts any new findings?

@w11wo
Copy link

w11wo commented Sep 29, 2024

Hi @mush42, we haven't found a solution to this issue yet. As I've mentioned, we didn't do any audio preprocessing such as low/high pass filters, etc., and we've used the default configs (other than bumping to 44.1kHz). I was wondering if you had any solutions in mind.

Also, I didn't find this issue when I trained FastSpeech2 + MB-MelGAN, perhaps there could be ideas from there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants