v0.0.14 #492

erogol · 2021-05-19T12:09:30Z

erogol
May 19, 2021
Maintainer

🐸 v0.0.14

🐞Bug Fixes

Remove breaking line from Tacotron models. (👑 @a-froghyra)

💾 Code updates

BREAKING: Coqpit integration for config management and the first 🐸TTS recipe, for LJSpeech Check 👩‍✈️ Coqpit refactor #476.

Every model now tied to a Python class that defines the configuration scheme. It provides a better interface and lets the user know better what are the default values, expected value types, and mandatory fields.

Specific model configs are defined under TTS/tts/configs and TTS/vocoder/configs. TTS/config/shared_configs.py hosts configs that are shared by all the 🐸 TTS models. Configs shared by tts models are hosted under TTS/tts/configs/shared_configs.py and shared by vocoder models are under TTS/vocoder/configs/shared_config.py.

For example TacotronConfig follows BaseTrainingConfig -> BaseTTSConfig -> TacotronConfig.

BREAKING: Remove phonemizer support due to License conflict.

This essentially deprecates the support for all the models using phonemes as input. Feel free to suggest in-place options if you are affected by this change.

Start hosting 👩‍🍳 recipes under 🐸 TTS. The first recipe is for Tacotron2-DDC with LJspeech dataset under TTS/recipes/.

Please check here for more details.

Add extract_tts_spectrograms.py that supports GlowTTS and Tacotron1-2. (👑 @Edresson)
Add version.py (👑 @chmodsss)

This discussion was created from the release v0.0.14.

george-roussos · 2021-05-19T14:15:47Z

george-roussos
May 19, 2021

Regarding the removal of phonemizer and replacement suggestions, I can think of an option, which is using the G2P function of Montreal Forced Aligner.

The basic requirement is a pronunciation lexicon that will be used for the task (languages such as English and German have very big ones that also happen to be open source). It is an option that is quite stable in my personal opinion (and observation), straightforward to train and much quicker to generate pronunciations for unknown words. You can also choose to train models on different n-gram orders and call them based on word lengths.

The pipeline would be as follows:

Perform a lexicon lookup to acquire pronunciation for each word
If the word is there, grab the pronunciation and synthesize
If not, use the g2p model to generate a probable pronunciation

The results to get are much better than phonemizer, because this way we get to implement stress markers (which phonemizer does not support), have more fine-grained control over what model will be used (thus improve accuracy) and also have better results (since phonemizer performs rule-based guessing and in many languages these rules change depending on phenomena such as compounding). In terms of speed, while yes, guessing pronunciations for unknown words does not happen instantly, using a transducer to perform the task is faster than instantiating phonemizer every single time and in the case where the text to synthesize has all the words in the pronunciation lexicon, the look-up is very fast.

4 replies

erogol May 19, 2021
Maintainer Author

I checked the MFA before but to me the limitations are;

No number or acronym normalization.
No contextual G2P. Meaning, it does not handle words pronounces differently depending on the context.

Do you think MFA can provide these?

I guess, using raw phonemes without the normalization only solves the words that are hard to pronounce like echo. Would you agree?

Can you also expand on how MFA helps us to have better control over the speech? By the lexicon lookup?

george-roussos May 19, 2021

I agree that MFA does not provide contextual G2P, but phonemizer doesn't either, or does it? Do you mean contextual as in picking the correct pronunciation for homographs? In any case, if contextual is what we are looking for (which I think is a different topic, although very much necessary), we should be looking into tagging the lexicon and by extension the text (so then we have the information for cross-referencing available) using different means.

WRT the acronym normalization, MFA does not do that either, but these can be handled during the text normalisation stage (and then the normalised text may be used for G2P). So I guess what I am trying to say is that we can leverage MFA for what it offers and extend the normalization functionality of Coqui.

For the last point: I was on about how we can implement things like stress markers and so further reduce the amount of errors, especially in languages with primary and secondary stresses. Also, since we are able to use multiple MFA models (depending on the length of the word), it is much more likely that we will get the correct phonemic sequence, while the rule-based guessing of phonemizer is much more fixed.

erogol May 19, 2021
Maintainer Author

Maybe MFA is not the optimal solution but can be a good start :)

Do you think all that can be taught to a NN? Maybe we can train a Seq2Seq model using the language-specific lexicon as input.

MFA uses n-gram models for G2P for unknown words, right?

In the doc they also say their method does not work well for non-phonemic languages like English if I remember correctly.

Some other libraries I checked so far for any interest

(I am expanding the list as we find more alternatives)

https://github.com/dmort27/epitran
https://github.com/CUNY-CL/wikipron
https://github.com/rhasspy/gruut

george-roussos May 19, 2021

Awesome!

On the seq2seq objective, I was actually going to mention it but did not -- I tried it a while ago, but, personally, the results I got were quite sub-optimal. There were mistakes in the transcription and I found the model quite prone to overfitting, probably because I did not do enough hyperparameter tuning. This may be a problem, since it may imply that for every language they have to be different. Also, the transcription time on CPU was slow. In the end, I abandoned it, because I did not notice any improvements. I will try to dig up the name of the library I used (it was a while ago and I am forgetting).

MFA use n-gram models for G2P, yeah. I am also seeing that their WER for some languages (including English) are higher; it makes sense yeah, because of the fact that English is less transparent orthography-wise. But I think it is definitely worth a try, seeing as there is a big pronunciation lexicon for English available (so many words already present) and there is a pretrained model for English, too.

mbarnig · 2021-05-20T06:43:00Z

mbarnig
May 20, 2021

I have some experience with TTS from the past when I explored espeak, MaryTTS and Festival to build a luxembourgish TTS voice. Last year I published a book in french about the history of speech synthesis. I am relatively new in the domain of machine-learning. I revisited my old ideas concerning the creation of an lb-TTS-voice a few weeks ago when Coqui-AI was launched. At the same time I discovered the rhasspy/larynx project developed by Michael Hansen (alias synesthesiam). Larynx uses gruut to transform text into phonemes.

I wonder why there is no close collaboration between coqui-tts and larynx/gruut. I think both Coqui-AI and Rhasspy are outstanding and I would like to congratulate and thank the authors of both projects.

Greetings from Luxembourg,

Marco Barnig

18 replies

domcross May 26, 2021

Can you be more specific? Phonetisaurus is what I assume you're referring to, which relies on pre-compiled binaries.

Exactly, I am refering to the Phonetisaurus dependency. Option 1) sounds most promising to me ;-)

mbarnig May 26, 2021

@kdavis-coqui. This license issue is complicated and tricky. My understanding now is that in all cases we need pronunciation dictionaries with the correct license, whether to use them in gruut or in DeepPhonemizer, to build a language model with a suitable license for Coqui-TTS. I googled for free pronunciation dictionaries and found only sources with a license mismatch. It seems to me that most people don't worry about the license restrictions.

I used the luxembourgish pronunciation dictionaries lb-lu.dic and phonetic transcription spellchecker-lu.txt to build the grapheme to phoneme model g2p.fst with gruut. These dictionaries are licensed under the European Union Public Licence (EUPL version 1.1). In the appendix it's stated that compatible licences according to article 5 EUPL are:

- GNU General Public License (GNU GPL) v. 2
- Open Software License (OSL) v. 2.1, v. 3.0
- Common Public License v. 1.0
- Eclipse Public License v. 1.0
- Cecill v. 2.0

It's not yet clear to me if this EUPL licence allow me to use or to share the luxembourgish gruut phoneme model g2p.fst.

synesthesiam May 26, 2021

Most of my IPA lexicons have come from here: https://github.com/open-dict-data/ipa-dict which claims an MIT license. The Credits section mentions this cmudict-ipa as a source for English, which does show an MIT license.

kdavis-coqui May 26, 2021
Maintainer

@synesthesiam Generally in machine learning one has to consider at least 3 licenses

License of the data
License of the model trained on the data
License of the code that uses the model

The license of the data dictates the possible model licenses and the model license dictates the possible code licenses.

For your case, cmudict-ipa looks good, with MIT, but the addition of stress markers using syllabify is problematic as the syllabify repo does not specify any license, despite the link from ipa-dict saying syllabify is MIT.

If a repo doesn't specify a license, "no one may reproduce, distribute, or create derivative works from" the repo[1]. So the modified version of cmudict-ipa is not really MIT. But it appears as if the non-modified version of cmudict-ipa is MIT.

As you can see, this can get painful.

kdavis-coqui May 26, 2021
Maintainer

@mbarnig When you say

It seems to me that most people don't worry about the license restrictions.

It's definitely true and makes this all the more difficult, and it's already hard even if people didn't make this (understandable) mistake.

As to the EUPL, I don't have any experience with that license. So I can't say anything too helpful.

However, the EUPL does state that

If the Licensee Distributes or Communicates Derivative Works or copies thereof based upon both
the Work and another work licensed under a Compatible Licence, this Distribution or Communication can be done
under the terms of this Compatible Licence

So it looks like when combining something under EUPL with one of the Compatible Licences you have a clear path forward as to how to license the combined work.

thorstenMueller · 2021-05-26T18:20:24Z

thorstenMueller
May 26, 2021

Probably a stupid idea, but should we ask the maintainer of espeak whether he would be willing to adjust the license for espeak?

2 replies

kdavis-coqui May 26, 2021
Maintainer

@thorstenMueller It's not a stupid idea.

However, the maintainer can't simply re-license a repo. The maintainer would have to get permission from every contributor to the repo to re-license the repo. This usually proves impractical for repos with many contributors.

thorstenMueller May 26, 2021

Okay, thanks for the explanation @kdavis-coqui. So that will probably not happen.

cschaefer26 · 2021-05-26T18:35:55Z

cschaefer26
May 26, 2021

@kdavis-coqui Thx for pointing out the license issues, I probably will have to add a disclaimer to the phonemizer repo. Hovewer, if you are interested I could train a model on an open source dataset and make it available so you can test it out.

0 replies

mbarnig · 2021-05-26T20:23:39Z

mbarnig
May 26, 2021

@kdavis-coqui and other contributors. Thank you very much for your detailed explanations about the licenses.
The topic becomes progressively clear to me. I have learned so much the last days with your great communities,
not only about licenses, but about machine-learning in general, about coding and about sharing ideas.

0 replies

synesthesiam · 2021-05-27T00:31:19Z

synesthesiam
May 27, 2021

@kdavis-coqui Do you think lexicons derived from Wiktionary would have any licensing issues?

1 reply

kdavis-coqui May 27, 2021
Maintainer

@synesthesiam In the licensing terms[1] it states that

Modifications or additions to material that you re-use: When modifying or making additions to text that you have obtained from a Project website, you agree to license the modified or added content under CC BY-SA 3.0 or later (or, as explained above, another license when exceptionally required by the specific Project edition or feature).

However, there are exceptions under "fair use", see section g of the section linked above. I'd guess a lawyer would be best equipped to determine these exceptions.

synesthesiam · 2021-06-02T14:52:46Z

synesthesiam
Jun 2, 2021

I've created a pull request to re-enable phoneme-based TTS models using gruut 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.14 #492

{{title}}

Replies: 7 comments 25 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

v0.0.14 #492

erogol May 19, 2021 Maintainer

🐸 v0.0.14

🐞Bug Fixes

💾 Code updates

Replies: 7 comments · 25 replies

erogol May 19, 2021 Maintainer Author

erogol May 19, 2021 Maintainer Author

kdavis-coqui May 26, 2021 Maintainer

kdavis-coqui May 26, 2021 Maintainer

kdavis-coqui May 26, 2021 Maintainer

kdavis-coqui May 27, 2021 Maintainer

erogol
May 19, 2021
Maintainer

Replies: 7 comments 25 replies

erogol May 19, 2021
Maintainer Author

erogol May 19, 2021
Maintainer Author

kdavis-coqui May 26, 2021
Maintainer

kdavis-coqui May 26, 2021
Maintainer

kdavis-coqui May 26, 2021
Maintainer

kdavis-coqui May 27, 2021
Maintainer