Question about transducer decoding #123

YoPatapon · 2021-01-21T12:29:08Z

According to the theory of RNN-T, multiple tokens can be generated at each timestep. For example, "cat\phi\phi\phi\phi\phi\phi" is also a valid decoding path for RNN-T in the case below. It seems like the current implementation of the greedy decoding process can generate only one token at most at each time. Do I understand it wrong?

nglehuy · 2021-01-21T17:55:27Z

Each time step is the array of probabilities of classes. The greedy decoding takes the class having the largest probability (that's why they call it "greedy"). So the question is: According to what perspective that you want to decode "multiple classes in single time step"?

YoPatapon · 2021-01-22T03:13:55Z

Each time step is the array of probabilities of classes. The greedy decoding takes the class having the largest probability (that's why they call it "greedy"). So the question is: According to what perspective that you want to decode "multiple classes in single time step"?

Refer to the picture below, \phi should be taken as the signal for the input of the next frame. If a new token generated, the current frame should be used again for the next step decoding, while the state of prediction network moves forward one step until a \phi is generated. In this case, 'c-a-t' is generated orderly in the first step, and the left frames all generate \phi which is also a valid path. Do I misunderstand something?

nglehuy · 2021-01-23T16:08:49Z

@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like greedy_v2 to see if it really improves the results.
Anyway, could you give me some documents about this? I wanna know why they "reuse" the current frame.

YoPatapon · 2021-01-25T01:44:02Z

@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like greedy_v2 to see if it really improves the results.
Anyway, could you give me some documents about this? I wanna know why they "reuse" the current frame.

Sorry, I found these figures showing the valid paths of RNN-T decoding on the web and the related blogs are in Chinese. I believe there are some English articles talking about this, I will send it to you when I find it. Here is a video on youtube demonstrating the greedy decoding process of RNN-T which you could refer to.

YoPatapon · 2021-01-25T11:48:05Z

@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like greedy_v2 to see if it really improves the results.
Anyway, could you give me some documents about this? I wanna know why they "reuse" the current frame.

Maybe you could refer to the paper EXPLORING ARCHITECTURES, DATA AND UNITS FOR STREAMING END-TO-END
SPEECH RECOGNITION WITH RNN-TRANSDUCER. Section 2 introduces the decoding process of RNN-T.

nglehuy · 2021-01-25T15:37:07Z

Thanks @YoPatapon, I'll look into them 😄

nglehuy · 2021-01-31T10:22:37Z

@YoPatapon I added Transducer Greedy V2 that behaves like you said, the result somehow got worse. (test on my pretrained Conformer describe in the readme of conformer directory example)

Checkout _perform_greedy_v2 to see the implementation.
What do you think?

YoPatapon · 2021-02-01T03:21:01Z

@YoPatapon I added Transducer Greedy V2 that behaves like you said, the result somehow got worse. (test on my pretrained Conformer describe in the readme of conformer directory example)

Checkout _perform_greedy_v2 to see the implementation.
What do you think?

I implemented batch greedy decode v1 and v2 myself, trained and tested on a Chinese speech recognition dataset. The results on the test set also got worse when using v2 but I found v2 helps improve the WER when decoding the train set. The v2 implementation introduces more paths in decoding compared to v1, maybe these paths are not practical in real scenarios which increase the difficulty of decoding but are allowed by the theory of RNN-T.

nglehuy · 2021-04-17T17:59:30Z

@YoPatapon I noticed that when I use greedy decoding v2 on English, the character ' is missing in the words that contain it.

nglehuy added the question Further information is requested label Jan 21, 2021

abhinavg4 mentioned this issue Jan 27, 2021

State of the Art for conformer and beam decoding #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about transducer decoding #123

Question about transducer decoding #123

YoPatapon commented Jan 21, 2021 •

edited

Loading

nglehuy commented Jan 21, 2021

YoPatapon commented Jan 22, 2021

nglehuy commented Jan 23, 2021

YoPatapon commented Jan 25, 2021

YoPatapon commented Jan 25, 2021

nglehuy commented Jan 25, 2021

nglehuy commented Jan 31, 2021

YoPatapon commented Feb 1, 2021

nglehuy commented Apr 17, 2021

Question about transducer decoding #123

Question about transducer decoding #123

Comments

YoPatapon commented Jan 21, 2021 • edited Loading

nglehuy commented Jan 21, 2021

YoPatapon commented Jan 22, 2021

nglehuy commented Jan 23, 2021

YoPatapon commented Jan 25, 2021

YoPatapon commented Jan 25, 2021

nglehuy commented Jan 25, 2021

nglehuy commented Jan 31, 2021

YoPatapon commented Feb 1, 2021

nglehuy commented Apr 17, 2021

YoPatapon commented Jan 21, 2021 •

edited

Loading