Skip to content

Question about transducer decoding #123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
YoPatapon opened this issue Jan 21, 2021 · 9 comments
Open

Question about transducer decoding #123

YoPatapon opened this issue Jan 21, 2021 · 9 comments
Labels
question Further information is requested

Comments

@YoPatapon
Copy link

YoPatapon commented Jan 21, 2021

According to the theory of RNN-T, multiple tokens can be generated at each timestep. For example, "cat\phi\phi\phi\phi\phi\phi" is also a valid decoding path for RNN-T in the case below. It seems like the current implementation of the greedy decoding process can generate only one token at most at each time. Do I understand it wrong?

image

@nglehuy nglehuy added the question Further information is requested label Jan 21, 2021
@nglehuy
Copy link
Collaborator

nglehuy commented Jan 21, 2021

Each time step is the array of probabilities of classes. The greedy decoding takes the class having the largest probability (that's why they call it "greedy"). So the question is: According to what perspective that you want to decode "multiple classes in single time step"?

@YoPatapon
Copy link
Author

Each time step is the array of probabilities of classes. The greedy decoding takes the class having the largest probability (that's why they call it "greedy"). So the question is: According to what perspective that you want to decode "multiple classes in single time step"?

Refer to the picture below, \phi should be taken as the signal for the input of the next frame. If a new token generated, the current frame should be used again for the next step decoding, while the state of prediction network moves forward one step until a \phi is generated. In this case, 'c-a-t' is generated orderly in the first step, and the left frames all generate \phi which is also a valid path. Do I misunderstand something?

image

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 23, 2021

@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like greedy_v2 to see if it really improves the results.
Anyway, could you give me some documents about this? I wanna know why they "reuse" the current frame.

@YoPatapon
Copy link
Author

@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like greedy_v2 to see if it really improves the results.
Anyway, could you give me some documents about this? I wanna know why they "reuse" the current frame.

Sorry, I found these figures showing the valid paths of RNN-T decoding on the web and the related blogs are in Chinese. I believe there are some English articles talking about this, I will send it to you when I find it. Here is a video on youtube demonstrating the greedy decoding process of RNN-T which you could refer to.

@YoPatapon
Copy link
Author

@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like greedy_v2 to see if it really improves the results.
Anyway, could you give me some documents about this? I wanna know why they "reuse" the current frame.

Maybe you could refer to the paper EXPLORING ARCHITECTURES, DATA AND UNITS FOR STREAMING END-TO-END
SPEECH RECOGNITION WITH RNN-TRANSDUCER
. Section 2 introduces the decoding process of RNN-T.

image

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 25, 2021

Thanks @YoPatapon, I'll look into them 😄

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 31, 2021

@YoPatapon I added Transducer Greedy V2 that behaves like you said, the result somehow got worse. (test on my pretrained Conformer describe in the readme of conformer directory example)
Screenshot from 2021-01-31 17-15-27
Checkout _perform_greedy_v2 to see the implementation.
What do you think?

@YoPatapon
Copy link
Author

@YoPatapon I added Transducer Greedy V2 that behaves like you said, the result somehow got worse. (test on my pretrained Conformer describe in the readme of conformer directory example)
Screenshot from 2021-01-31 17-15-27
Checkout _perform_greedy_v2 to see the implementation.
What do you think?

I implemented batch greedy decode v1 and v2 myself, trained and tested on a Chinese speech recognition dataset. The results on the test set also got worse when using v2 but I found v2 helps improve the WER when decoding the train set. The v2 implementation introduces more paths in decoding compared to v1, maybe these paths are not practical in real scenarios which increase the difficulty of decoding but are allowed by the theory of RNN-T.

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 17, 2021

@YoPatapon I noticed that when I use greedy decoding v2 on English, the character ' is missing in the words that contain it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants