Replies: 3 comments 5 replies
-
I've just started training an ljspeech model (Tacotron 2 DDC) with gruut phonemes (split into single characters by 🐸 TTS). I'll update my progress here. |
Beta Was this translation helpful? Give feedback.
4 replies
-
New pull request: #561 Should be ready for training. |
Beta Was this translation helpful? Give feedback.
1 reply
-
Pull request #561 has been merged! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TTS is the abbreviation for text-to-speech. Text is written with graphemes. These can be simple ASCII characters for english, accentuated characters (è é ä ü ö ï Ë à etc) for some european languages, greek and cyrillic characters for other european languages or pictural symbols for asian and arabic languages.
Coqui-TTS converts the graphemes into numbers to do the computation for machine learning. Up to version 0.0.13.2 of Coqui-TTS the graphemes are first converted into IPA phonemes with the embedded espeak-ng package. In a second step the numbers are generated as indices of IPA phonemes, listed in the model configuration file, as follows :
In version 0.0.14. and 0.0.15. of Coqui-TTS espeak-ng has been removed due to it's restrictive license. The tool gruut, created by Michael Hansen, alias @synesthesiam, for his Rhasspy project, is considered as a good candidate to replace espeak-ng.
A related pull request, prepared by Michael Hansen, has been discussed in the last days, mainly concerning the interface between gruut and Coqui-TTS and the process to convert phonemes into numbers for the ML-computation. Here are the current conclusions (to my understanding) :
To provide open speech technologies for everyone, Coqui-TTS should remain a unique package with embedded grapheme > phoneme > number conversion.
To assure the support of the existing released models the embedded conversion should comply with the existing rules, as far as possible. This is important for inference and for fine-tuning (or transfer-learning) existent models, to create new voices and language models.
The complexity of the new code should be low to facilitate the maintenance of the whole package.
An optional input, controlled by a flag argument, should allow to enter the numbers for the ML-computation without conversion. This will allow developers to create their own front-end to use different speech-segments for training and inference, for example di-phones, syllables or whole words.
One point is still open. @erogol proposed to move the subject to the Coqui-TTS-discussion forum, what I tried to do with the present thread.
The open point is the processing of phonemes specified with multiple symbols. As an example I use the specific luxembourgish diphtong
æːɪ
in the middle of the luxembourgish wordZäit
(time, Zeit, temps).This phoneme is splitted into three numbers (indices of
æ, ː, ɪ
) by Coqui-TTS and into one number (index ofæːɪ
) by gruut.I am new in the ML-domain, but my feeling (based on my experience with ancient TTS) is that the second conversion leads to better models and to lower training times.
I started to compare models created by both methods, with the same dataset. I will report about my findings as soon as possible.
Beta Was this translation helpful? Give feedback.
All reactions