-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentencepiece #93
Sentencepiece #93
Conversation
Please get the newest code and resolve conflicts 😄 |
Add support for Multilingual LibriSpeech dataset
Hi @gandroz, thank you for your contribution, I have some suggestions: Firstly, def __init_upoints(self):
text = [""]
for idx in range(1, self.num_classes):
text.append(self.model.decode_ids([idx]))
self.upoints = tf.strings.unicode_decode(text, "UTF-8")
self.upoints = self.upoints.to_tensor() # [num_classes, max_subword_length] Are used when the Secondly, why do you repeat the class |
Hmmm the duplicate was apparently due to a merge conflict, sorry. |
@gandroz The blank is not suppose to be hard-coded to 0 (yeah the |
About this file. You should name it as the corpus that you create subwords from 😄 |
@gandroz Please change the name of the |
@usimarit actually, it is your file which stands in your Drive that I used to perform some (unit)tests |
@gandroz Oh I see 😄 But that file contains subwords from test-clean too, so it's not quite true, you should use the |
@usimarit ok no problem. Anyway, that's just for some tests ;) not for performances |
@gandroz I just want to remove unnecessary components to keep the repo clean 😄 |
I wanted to add the sentencepiece featurizer from Google instead of using subword on tensorflow when training a conformer.