Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, I seem to find a bug in the code.
In
extract_features
function ofNgramTransformerDecoder
, a transpose operation is applied toattn
, which is the output ofNgramTransformerDecoderLayer
. The code snippet is as follows:As can be seen from the code comments, it's purpose is to change the dims from
[(1+ngram)*T, B, C]
to[B, (1+ngram)*T, C]
. The variableattn
, fromNgramTransformerDecoderLayer
, is the second result returned by itsencoder_attn
(fairseq.modules.MultiheadAttention).In fairseqv0.9.0, the code snippet of MultiheadAttention's forward function is as follows:
It can be seen that, the second result of
forward
function(attn_weights
), has the shape(bsz, self.num_heads, tgt_len, src_len)
originally. Aftertranspose
andmean
operator, it has the shape(bsz, tgt_len, src_len)
, which is the actual shape ofattn
mentioned inextract_features
rather than(1+ngram)*T, B, C
described in the comment. BTW, shape and transpose ofx
inextract_features
is right. And theattn
is not actually used during training and inferencing. So I guess it's the reason why it has not been found for 2 years.But if one wants to some modification and needs to use the variable
attn
, like me, will find it has a confusing shape caused by thetranspose
operator. And it does take me some time to find the bug.Hoping the PR can be merged.