Sentencepiece #93

gandroz · 2020-12-31T04:23:41Z

I wanted to add the sentencepiece featurizer from Google instead of using subword on tensorflow when training a conformer.

nglehuy · 2020-12-31T09:17:08Z

Please get the newest code and resolve conflicts 😄

Add support for Multilingual LibriSpeech dataset

into sentencepiece

nglehuy · 2021-01-02T07:54:27Z

Hi @gandroz, thank you for your contribution, I have some suggestions:

Firstly,
These lines:

def __init_upoints(self):
    text = [""]
    for idx in range(1, self.num_classes):
        text.append(self.model.decode_ids([idx]))
    self.upoints = tf.strings.unicode_decode(text, "UTF-8")
    self.upoints = self.upoints.to_tensor()  # [num_classes, max_subword_length]

Are used when the blank = 0 (which is the text = [""]). Can you either update the blank to 0 or update this function please.

Secondly, why do you repeat the class SentencePieceFeaturizer twice?

gandroz · 2021-01-02T16:16:04Z

Hmmm the duplicate was apparently due to a merge conflict, sorry.
I also corrected the blank to use the PAD tag of sentencepiece. I noticed the blank is hard-coded to 0, maybe it could be configurable of class-dependent ?
Finally, I added some tests to be sure the featurizer is consistent with the subword featurizer.

nglehuy · 2021-01-02T16:52:38Z

@gandroz The blank is not suppose to be hard-coded to 0 (yeah the SubwordFeaturizer has the blank hard-coded to 0 so that the blank matches upoint's blank), only the unicode points have the blank is always 0, which is the NULL character \000. The upoints is an array which indices are the ids, if the blank is the id 3 then the upoints[3] = 0. The purpose of upoints is to support tflite to decode to Unicode Points instead of ids or string (tf.string is not supported yet in tflite) and in the future it might be used for multi-languages ASR.
Furthermore, according to the author of warp-transducer, we should use blank=0 for better performance.

nglehuy · 2021-01-02T16:59:33Z

About this file. You should name it as the corpus that you create subwords from 😄

nglehuy · 2021-01-04T16:53:06Z

@gandroz Please change the name of the vocabularies/conformer.subwords 😄

gandroz · 2021-01-04T16:59:54Z

@usimarit actually, it is your file which stands in your Drive that I used to perform some (unit)tests

nglehuy · 2021-01-04T17:27:31Z

@gandroz Oh I see 😄 But that file contains subwords from test-clean too, so it's not quite true, you should use the vocabularies/librispeech_train_4_1030.subwords and remove that vocabularies/conformer.subwords

gandroz · 2021-01-04T18:36:03Z

@usimarit ok no problem. Anyway, that's just for some tests ;) not for performances

nglehuy · 2021-01-05T02:18:29Z

@gandroz I just want to remove unnecessary components to keep the repo clean 😄

monatis and others added 3 commits December 31, 2020 06:23

Add support for Multilingual LibriSpeech dataset

43edfda

add sentencepiece featurizer

5fac7fc

add requirement

f17e6ca

monatis and others added 5 commits December 31, 2020 12:48

Remove blank char from auto-generated alphabet

d2c890a

Merge pull request TensorSpeech#92 from monatis/main

b131a7c

Add support for Multilingual LibriSpeech dataset

add sentencepiece featurizer

755a081

add requirement

4214b33

Merge branch 'sentencepiece' of https://github.com/gandroz/TensorFlowASR

dbc6053

into sentencepiece

remove duplicate and correct blank

ac4ed5e

nglehuy self-requested a review January 4, 2021 16:53

remove subwords file

75282c6

nglehuy approved these changes Jan 5, 2021

View reviewed changes

nglehuy merged commit 488b14e into TensorSpeech:main Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentencepiece #93

Sentencepiece #93

Uh oh!

gandroz commented Dec 31, 2020

Uh oh!

nglehuy commented Dec 31, 2020

Uh oh!

nglehuy commented Jan 2, 2021

Uh oh!

gandroz commented Jan 2, 2021

Uh oh!

nglehuy commented Jan 2, 2021 •

edited

Loading

Uh oh!

nglehuy commented Jan 2, 2021

Uh oh!

nglehuy commented Jan 4, 2021

Uh oh!

gandroz commented Jan 4, 2021

Uh oh!

nglehuy commented Jan 4, 2021

Uh oh!

gandroz commented Jan 4, 2021

Uh oh!

nglehuy commented Jan 5, 2021

Uh oh!

Uh oh!

Sentencepiece #93

Sentencepiece #93

Uh oh!

Conversation

gandroz commented Dec 31, 2020

Uh oh!

nglehuy commented Dec 31, 2020

Uh oh!

nglehuy commented Jan 2, 2021

Uh oh!

gandroz commented Jan 2, 2021

Uh oh!

nglehuy commented Jan 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nglehuy commented Jan 2, 2021

Uh oh!

nglehuy commented Jan 4, 2021

Uh oh!

gandroz commented Jan 4, 2021

Uh oh!

nglehuy commented Jan 4, 2021

Uh oh!

gandroz commented Jan 4, 2021

Uh oh!

nglehuy commented Jan 5, 2021

Uh oh!

Uh oh!

nglehuy commented Jan 2, 2021 •

edited

Loading