Right approach to set character lists for TTS models? #1112

erogol · 2022-01-14T12:47:19Z

erogol
Jan 14, 2022
Maintainer

Any preference on the way to set character lists for 🐸TTS models?

We currently do

list = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

however, I also see some merit in changing it to the following

list = ['A', 'B', 'C' ... ,'x', 'y', 'z' ]

That way we could even use multicharacter tokens.

What do you think?

mbarnig · 2022-01-14T14:59:44Z

mbarnig
Jan 14, 2022

I have a preference for the second option to be able to use multicharacter tokens.
But it makes only sense if a multicharacter token is converted to a single integer in the tensor.

0 replies

vince62s · 2022-01-14T15:04:52Z

vince62s
Jan 14, 2022

In this case don't call it character list, not even multicharacter list, but switch to token as in "subwords" (taken from other NLP stuff) and even phonemes could be considered as tokens. your list becomes a "vocabulary".

1 reply

erogol Jan 16, 2022
Maintainer Author

makes sense

erogol · 2022-01-16T13:08:02Z

erogol
Jan 16, 2022
Maintainer Author

The problem with the 2nd approach is tokenization. Since in the vocab we might have "a" and "a`" as separate tokens, then we need a more advanced way to match tokens in the text.

It'd be slower than our current simple tokenization routine, and I don't know how to do that ATM 😄

0 replies

Ca-ressemble-a-du-fake · 2022-01-31T20:51:02Z

Ca-ressemble-a-du-fake
Jan 31, 2022

By the way in some config.json files I see characters written with unicode code :
`

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? "
`

Should I keep them like this or should I convert them to "real" chars (eg é, ç, ...) ?

0 replies

erogol · 2022-02-01T12:43:19Z

erogol
Feb 1, 2022
Maintainer Author

you can directly copy and paste

0 replies

r7sa · 2022-02-03T11:26:55Z

r7sa
Feb 3, 2022

For case "a" and "a`" you can use rule "most longest" sequence match. I don't think we need to use dynamic programming for find "best" tokenization or something else.
For example "hello*" with tokens list [h, e, l, o, e*, o*] will tokenized as [h, e, l, l, o*].
It can be simple implemented with re.split with right constrained expression and on my system tokenized about 50k strings per second.
My test implementation for fastest method is here: https://gist.github.com/r7sa/2ab8750eb94cf05f21d83b8faa45ed66#file-tokenize-py-L63-L83 (rest of file is 2 other methods but they are slower).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Right approach to set character lists for TTS models? #1112

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Right approach to set character lists for TTS models? #1112

erogol Jan 14, 2022 Maintainer

Replies: 6 comments · 1 reply

mbarnig Jan 14, 2022

vince62s Jan 14, 2022

erogol Jan 16, 2022 Maintainer Author

erogol Jan 16, 2022 Maintainer Author

Ca-ressemble-a-du-fake Jan 31, 2022

erogol Feb 1, 2022 Maintainer Author

r7sa Feb 3, 2022

erogol
Jan 14, 2022
Maintainer

Replies: 6 comments 1 reply

mbarnig
Jan 14, 2022

vince62s
Jan 14, 2022

erogol Jan 16, 2022
Maintainer Author

erogol
Jan 16, 2022
Maintainer Author

Ca-ressemble-a-du-fake
Jan 31, 2022

erogol
Feb 1, 2022
Maintainer Author

r7sa
Feb 3, 2022