Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentencepiece #93

Merged
merged 10 commits into from
Jan 5, 2021
Merged

Sentencepiece #93

merged 10 commits into from
Jan 5, 2021

Conversation

gandroz
Copy link
Contributor

@gandroz gandroz commented Dec 31, 2020

I wanted to add the sentencepiece featurizer from Google instead of using subword on tensorflow when training a conformer.

@nglehuy
Copy link
Collaborator

nglehuy commented Dec 31, 2020

Please get the newest code and resolve conflicts 😄

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 2, 2021

Hi @gandroz, thank you for your contribution, I have some suggestions:

Firstly,
These lines:

def __init_upoints(self):
    text = [""]
    for idx in range(1, self.num_classes):
        text.append(self.model.decode_ids([idx]))
    self.upoints = tf.strings.unicode_decode(text, "UTF-8")
    self.upoints = self.upoints.to_tensor()  # [num_classes, max_subword_length]

Are used when the blank = 0 (which is the text = [""]). Can you either update the blank to 0 or update this function please.

Secondly, why do you repeat the class SentencePieceFeaturizer twice?

@gandroz
Copy link
Contributor Author

gandroz commented Jan 2, 2021

Hmmm the duplicate was apparently due to a merge conflict, sorry.
I also corrected the blank to use the PAD tag of sentencepiece. I noticed the blank is hard-coded to 0, maybe it could be configurable of class-dependent ?
Finally, I added some tests to be sure the featurizer is consistent with the subword featurizer.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 2, 2021

@gandroz The blank is not suppose to be hard-coded to 0 (yeah the SubwordFeaturizer has the blank hard-coded to 0 so that the blank matches upoint's blank), only the unicode points have the blank is always 0, which is the NULL character \000. The upoints is an array which indices are the ids, if the blank is the id 3 then the upoints[3] = 0. The purpose of upoints is to support tflite to decode to Unicode Points instead of ids or string (tf.string is not supported yet in tflite) and in the future it might be used for multi-languages ASR.
Furthermore, according to the author of warp-transducer, we should use blank=0 for better performance.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 2, 2021

About this file. You should name it as the corpus that you create subwords from 😄

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 4, 2021

@gandroz Please change the name of the vocabularies/conformer.subwords 😄

@nglehuy nglehuy self-requested a review January 4, 2021 16:53
@gandroz
Copy link
Contributor Author

gandroz commented Jan 4, 2021

@usimarit actually, it is your file which stands in your Drive that I used to perform some (unit)tests

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 4, 2021

@gandroz Oh I see 😄 But that file contains subwords from test-clean too, so it's not quite true, you should use the vocabularies/librispeech_train_4_1030.subwords and remove that vocabularies/conformer.subwords

@gandroz
Copy link
Contributor Author

gandroz commented Jan 4, 2021

@usimarit ok no problem. Anyway, that's just for some tests ;) not for performances

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 5, 2021

@gandroz I just want to remove unnecessary components to keep the repo clean 😄

@nglehuy nglehuy merged commit 488b14e into TensorSpeech:main Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants