Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About key_padding_mask in multihead self attention #36

Open
YuchenHui22314 opened this issue Apr 21, 2023 · 3 comments
Open

About key_padding_mask in multihead self attention #36

YuchenHui22314 opened this issue Apr 21, 2023 · 3 comments

Comments

@YuchenHui22314
Copy link

Hi!

Thank you for your implementation!

I would like to know if there are particular reasons why https://github.com/pmixer/SASRec.pytorch/blob/master/model.py#L83 this line for key_padding_mask is commented? It seems that this mask is necessary to prevent from attending to paddings?

Thanks again,

Sincerely

Yuchen

@pmixer
Copy link
Owner

pmixer commented Apr 21, 2023

Hi!

Thank you for your implementation!

I would like to know if there are particular reasons why https://github.com/pmixer/SASRec.pytorch/blob/master/model.py#L83 this line for key_padding_mask is commented? It seems that this mask is necessary to prevent from attending to paddings?

Thanks again,

Sincerely

Yuchen

Thx for the question, I could hardly recall the exact details in short time, well, generally:

As padding item is 0, corresponding to all 0 embeddings as initialized in https://github.com/pmixer/SASRec.pytorch/blob/master/model.py#L36, attending to 0 would not effect the output

Also, for pytorch api, there's the statement If both attn_mask and key_padding_mask are supplied, their types should match. currently, I may encountered the issue when there was the statement attn_mask and key_padding_mask can not be used at the same time when using pytorch 1.16 for implementation about 3 years ago, thus had one of them commented.

Pls feel free to uncomment the line to check how would key mask affect model training and inference.

@YuchenHui22314
Copy link
Author

Thanks for your prompt reply!

Oh ok I see, I will try! After careful consideration, I think, conceptuallly, that even letting all paddings to 0 (what the code does now) will still influence the attention mechanism, since before attention softmax, the original attention score could be negative, 0 or positive. Therefore zero does not mean sth special for softmax function (should be -inf).

Thanks again!

Yuchen

@pmixer
Copy link
Owner

pmixer commented Apr 24, 2023

attention

Thx, for attention score, yes, -inf is required as you described. Well, here I mean attending to 0 equals attending to nothing actually, I didn't mean attention score being zero in former reply. When someone used K dot Q to get attention scores, then use attention scores to do the weighted sum of V, V=0 will not effect the final output theoritically.

BTW, some of the lectures by Prof. Lee may help further clarifying these details about multi-head attention, pls consider checking https://www.youtube.com/@HungyiLeeNTU/search?query=attention if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants