-
Notifications
You must be signed in to change notification settings - Fork 245
Question about transducer decoding #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Each time step is the array of probabilities of classes. The greedy decoding takes the class having the largest probability (that's why they call it "greedy"). So the question is: According to what perspective that you want to decode "multiple classes in single time step"? |
Refer to the picture below, \phi should be taken as the signal for the input of the next frame. If a new token generated, the current frame should be used again for the next step decoding, while the state of prediction network moves forward one step until a \phi is generated. In this case, 'c-a-t' is generated orderly in the first step, and the left frames all generate \phi which is also a valid path. Do I misunderstand something? |
@YoPatapon I see your point. I haven't met this theory but it looks reasonable. We should implement this, call it like |
Sorry, I found these figures showing the valid paths of RNN-T decoding on the web and the related blogs are in Chinese. I believe there are some English articles talking about this, I will send it to you when I find it. Here is a video on youtube demonstrating the greedy decoding process of RNN-T which you could refer to. |
Maybe you could refer to the paper EXPLORING ARCHITECTURES, DATA AND UNITS FOR STREAMING END-TO-END |
Thanks @YoPatapon, I'll look into them 😄 |
@YoPatapon I added Transducer Greedy V2 that behaves like you said, the result somehow got worse. (test on my pretrained Conformer describe in the readme of conformer directory example) |
I implemented batch greedy decode v1 and v2 myself, trained and tested on a Chinese speech recognition dataset. The results on the test set also got worse when using v2 but I found v2 helps improve the WER when decoding the train set. The v2 implementation introduces more paths in decoding compared to v1, maybe these paths are not practical in real scenarios which increase the difficulty of decoding but are allowed by the theory of RNN-T. |
@YoPatapon I noticed that when I use greedy decoding v2 on English, the character |
According to the theory of RNN-T, multiple tokens can be generated at each timestep. For example, "cat\phi\phi\phi\phi\phi\phi" is also a valid decoding path for RNN-T in the case below. It seems like the current implementation of the greedy decoding process can generate only one token at most at each time. Do I understand it wrong?
The text was updated successfully, but these errors were encountered: