-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle inputs for IPA, character, Arpabet modelling #216
Comments
In EveryVoice this is set to false by default With language
|
Keeping a list of issues to raise:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Because:
We need to handle the processing (normalization, phonemization, tokenization) of these various representations of text. To do that, I propose:
DatasetTextRepresentation
Enum field that requires users to specify the representation level of their text (Character|IPA|Arpabet)TargetTrainingRepresentationLevel
Enum field (characters|phones|phonological features) to models which use textg2p
library and useipatok
library to tokenize output (withunknown=True
to allow punctuation and other unknown characters through).character
sequence andphone
sequence to the filelist generated by the preprocessor if a valid g2p method for the languages exists. Join these sequences with forward slashes (ie. h/e/l/l/o h/o/w a/r/e y/o/u) since pipes are already taken in that format. Allow forward slashes to be escaped maybe?Allow custom tokenizer to be defined...yagniRough sketch to help jog my memory:
The text was updated successfully, but these errors were encountered: