Replies: 6 comments 1 reply
-
I have a preference for the second option to be able to use multicharacter tokens. |
Beta Was this translation helpful? Give feedback.
-
In this case don't call it character list, not even multicharacter list, but switch to token as in "subwords" (taken from other NLP stuff) and even phonemes could be considered as tokens. your list becomes a "vocabulary". |
Beta Was this translation helpful? Give feedback.
-
The problem with the 2nd approach is tokenization. Since in the vocab we might have "a" and "a`" as separate tokens, then we need a more advanced way to match tokens in the text. It'd be slower than our current simple tokenization routine, and I don't know how to do that ATM 😄 |
Beta Was this translation helpful? Give feedback.
-
By the way in some config.json files I see characters written with unicode code :
Should I keep them like this or should I convert them to "real" chars (eg é, ç, ...) ? |
Beta Was this translation helpful? Give feedback.
-
you can directly copy and paste |
Beta Was this translation helpful? Give feedback.
-
For case "a" and "a`" you can use rule "most longest" sequence match. I don't think we need to use dynamic programming for find "best" tokenization or something else. |
Beta Was this translation helpful? Give feedback.
-
Any preference on the way to set character lists for 🐸TTS models?
We currently do
list = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
however, I also see some merit in changing it to the following
list = ['A', 'B', 'C' ... ,'x', 'y', 'z' ]
That way we could even use multicharacter tokens.
What do you think?
Beta Was this translation helpful? Give feedback.
All reactions