-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New pretrained synthesizer model (tensorflow) #538
Comments
This model still has the occasional attention failure. However, this is not caused by Corentin's modifications to Rayhane's taco2. I have studied the differences line by line and concluded that there is no error introduced. Rather, I think the attention problems are inherent to the SV2TTS architecture, particularly because the speaker embedding is input to the attention mechanism. Attention is problematic even in single-speaker tacotrons, and it gets worse in multispeaker due to the speaker embedding concat. This highlights the need to use a better attention mechanism for SV2TTS. |
amazing work! Thanks @blue-fish ! |
why are your dropbox links not working |
Trained on LibriSpeech, using the current synthesizer (tensorflow). This performs similarly to the current model, with fewer random gaps appearing in the middle of synthesized utterances. It handles short input texts better too.
Download link: https://www.dropbox.com/s/3kyjgew55c4yxtf/librispeech_270k_tf.zip?dl=0
Unzip the file and move the
logs-pretrained
folder tosynthesizer/saved_models
.I am not going to provide scripts to reproduce the training. For anyone interested, you will need to curate LibriSpeech to have more consistent prosody. This is what I did when running synthesizer_preprocess_audio.py:
silence_min_duration_split=0.05
encoder.preprocess_wav()
on each wav, this will use voice activation detection to trim silences (see Trim silences during synthesizer preprocess #501). Compare the lengths of the "before" and "after" wavs. If they don't match then it means a silence is detected and it is discarded. I keep the "before" wav if the lengths match.datasets_root/SV2TTS/synthesizer/train.txt
to include utterances between 225 and 600 mel frames (2.8 to 7.5 sec). This leaves 48 hours of training data.The text was updated successfully, but these errors were encountered: