You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, encoder-decoder models use either head_mask or decoder_head_mask for masking attention heads in cross-attention modules. Both cases are not perfectly correct. Furthermore, MHA in cross-attention modules shares the parameters with the decoder, i.e. shape = (decoder.num_layers, decoder.num_attention_heads), therefore, the usage of encoder head_mask in the cross-attention module may lead to errors due to the shape mismatch.
My contribution: I will take care of this issue this weekend.
The text was updated successfully, but these errors were encountered:
stancld
changed the title
🐛 Fix attention head mask for cross-attention module in encoder-decoder models
🐛 Bug in attention head mask for cross-attention module in encoder-decoder models
Mar 5, 2021
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Currently, encoder-decoder models use either
head_mask
ordecoder_head_mask
for masking attention heads in cross-attention modules. Both cases are not perfectly correct. Furthermore, MHA in cross-attention modules shares the parameters with the decoder, i.e.shape = (decoder.num_layers, decoder.num_attention_heads)
, therefore, the usage of encoderhead_mask
in the cross-attention module may lead to errors due to the shape mismatch.My contribution: I will take care of this issue this weekend.
Reviewers: @patil-suraj @patrickvonplaten
The text was updated successfully, but these errors were encountered: