-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trim silences during synthesizer preprocess #501
Comments
Hi Bluefish, (big fan of your work here!) Thank for the tip! please can you correct me if I'm wrong? you mean HERE? `# Load the audio waveform HERE? Get the corresponding text` Thanks in advance! |
@javaintheuk Right before checking hparams.rescale. I have noticed that for a few utterances in VCTK the preprocess result will be None or an empty wav which causes an error, so if you experience this you could follow up the preprocess_wav with
Thanks for expressing your interest in this idea, I will consider submitting a pull request once I've figured out a good implementation. Anyone in the community is also welcome to contribute ideas for better preprocessing or to submit a PR. |
@javaintheuk In #472, I settled on this implementation: https://github.com/blue-fish/Real-Time-Voice-Cloning/compare/1d0d650...blue-fish:d692584 Trimming silences from training data is very important when working with fatchord's tacotron1 model so I am going to bundle it with the pytorch synthesizer. |
I preprocessed LibriSpeech using Edit: I did not notice a difference in the model when splitting at 0.2 and 0.4 seconds if VAD is applied. Now trying splitting on silences of 0.05 seconds without VAD. |
Thanks for the update! :) |
Synthesizer preprocess generally does not trim silences from wav files in the dataset. (An exception is is the dataset has alignments, such as LibriSpeech. Those alignment files contain data that is used to trim leading and trailing silence from an utterance.)
We should apply voice activation detection (webrtcvad) to help trim excess silence from other datasets like VCTK. I notice my synth models trained on VCTK synthesize a lot of leading and trailing silence and think this is the reason. All that is needed is to add this line:
wav = encoder.preprocess_wav(wav)
after librosa loads the wav.Real-Time-Voice-Cloning/synthesizer/preprocess.py
Lines 57 to 84 in a32962b
The text was updated successfully, but these errors were encountered: