Deep Learning Phonemes #549

mbarnig · 2021-06-07T09:57:50Z

mbarnig
Jun 7, 2021

TTS is the abbreviation for text-to-speech. Text is written with graphemes. These can be simple ASCII characters for english, accentuated characters (è é ä ü ö ï Ë à etc) for some european languages, greek and cyrillic characters for other european languages or pictural symbols for asian and arabic languages.

Coqui-TTS converts the graphemes into numbers to do the computation for machine learning. Up to version 0.0.13.2 of Coqui-TTS the graphemes are first converted into IPA phonemes with the embedded espeak-ng package. In a second step the numbers are generated as indices of IPA phonemes, listed in the model configuration file, as follows :

 # Phonemes definition (All IPA characters)
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
_suprasegmentals = "ˈˌːˑ"
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
_diacrilics = "ɚ˞ɫ"
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics

In version 0.0.14. and 0.0.15. of Coqui-TTS espeak-ng has been removed due to it's restrictive license. The tool gruut, created by Michael Hansen, alias @synesthesiam, for his Rhasspy project, is considered as a good candidate to replace espeak-ng.

A related pull request, prepared by Michael Hansen, has been discussed in the last days, mainly concerning the interface between gruut and Coqui-TTS and the process to convert phonemes into numbers for the ML-computation. Here are the current conclusions (to my understanding) :

To provide open speech technologies for everyone, Coqui-TTS should remain a unique package with embedded grapheme > phoneme > number conversion.
To assure the support of the existing released models the embedded conversion should comply with the existing rules, as far as possible. This is important for inference and for fine-tuning (or transfer-learning) existent models, to create new voices and language models.
The complexity of the new code should be low to facilitate the maintenance of the whole package.
An optional input, controlled by a flag argument, should allow to enter the numbers for the ML-computation without conversion. This will allow developers to create their own front-end to use different speech-segments for training and inference, for example di-phones, syllables or whole words.

One point is still open. @erogol proposed to move the subject to the Coqui-TTS-discussion forum, what I tried to do with the present thread.

The open point is the processing of phonemes specified with multiple symbols. As an example I use the specific luxembourgish diphtong æːɪ in the middle of the luxembourgish word Zäit (time, Zeit, temps).

This phoneme is splitted into three numbers (indices of æ, ː, ɪ) by Coqui-TTS and into one number (index of æːɪ) by gruut.

I am new in the ML-domain, but my feeling (based on my experience with ancient TTS) is that the second conversion leads to better models and to lower training times.

I started to compare models created by both methods, with the same dataset. I will report about my findings as soon as possible.

synesthesiam · 2021-06-07T20:45:04Z

synesthesiam
Jun 7, 2021

I've just started training an ljspeech model (Tacotron 2 DDC) with gruut phonemes (split into single characters by 🐸 TTS). I'll update my progress here.

4 replies

synesthesiam Jun 8, 2021

I trained an LJSpeech model overnight (Tacotron2 DDC) with gruut's phonemes for only 40K steps, and here are the results: https://synesthesiam.github.io/coqui-tts-tests/

My only modifications to the sample config were disabling mixed precision and altering the gradual training schedule to speed things up:

    "mixed_precision": false,
    "gradual_training": [
        [
            0,
            6,
            150
        ],
        [
            1000,
            4,
            75
        ],
        [
            5000,
            3,
            75
        ],
        [
            10000,
            2,
            75
        ]
    ],

Mixed precision resulted in a NaN loss after ~2K steps as always. This was trained on an RTX 3090 with nvidia-docker and the pytorch:21.02-py3 Docker image.

erogol Jun 8, 2021
Maintainer

I don't hear any pronunciation issue. Good work :).

What vocoder did you use ?

synesthesiam Jun 8, 2021

Thanks! I used the pre-trained MelGAN vocoder with some denoising applied (a future pull request, if you're interested).

I don't know what's changed in the loss functions, but the alignment happened extremely quickly. Well done! I know it would take more training to get the intonation and cadence better, but this was certainly the fastest I've ever gotten a Tacotron 2 model off the ground 🙂

I'll go ahead and add the tests to the existing pull request, as well as remove the static phoneme map. Do you want me to keep primary/secondary stress characters out? I left them out originally to fit closer with the eSpeak phonemes I saw, but gruut has them for English and a few other languages.

erogol Jun 9, 2021
Maintainer

Good to hear your positive experience with Tacotron2 :).

Denoising sounds definitely interesting. I guess we can add another submodule for such applications. Feel free to send a PR about that.

Any character can stay as long as they are also in symbols.py.

Looking forward to merging the PR.

synesthesiam · 2021-06-09T18:50:22Z

synesthesiam
Jun 9, 2021

New pull request: #561

Should be ready for training.

1 reply

erogol Jun 15, 2021
Maintainer

I'll give it a shot after the trainer API is done.

But this should not stop people to experiment.

synesthesiam · 2021-06-25T13:14:04Z

synesthesiam
Jun 25, 2021

Pull request #561 has been merged!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Learning Phonemes #549

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deep Learning Phonemes #549

mbarnig Jun 7, 2021

Replies: 3 comments · 5 replies

synesthesiam Jun 7, 2021

synesthesiam Jun 8, 2021

erogol Jun 8, 2021 Maintainer

synesthesiam Jun 8, 2021

erogol Jun 9, 2021 Maintainer

synesthesiam Jun 9, 2021

erogol Jun 15, 2021 Maintainer

synesthesiam Jun 25, 2021

mbarnig
Jun 7, 2021

Replies: 3 comments 5 replies

synesthesiam
Jun 7, 2021

erogol Jun 8, 2021
Maintainer

erogol Jun 9, 2021
Maintainer

synesthesiam
Jun 9, 2021

erogol Jun 15, 2021
Maintainer

synesthesiam
Jun 25, 2021