When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

YTZ01 · 2023-02-25T03:30:47Z

for i in range(len(self.attention_layers)):
seqs = torch.transpose(seqs, 0, 1)
Q = self.attention_layernormsi
mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs,
attn_mask=attention_mask)
# key_padding_mask=timeline_mask
# need_weights=False) this arg do not work?
seqs = Q + mha_outputs
seqs = torch.transpose(seqs, 0, 1)
In the SASRec paper, Ⅲ. Methodology part B.Self-Attention Block, the formula uses the same embedding object as queries, keys and values, then converts it through linear projections. Why does queries are normalized, while keys and values are not in the code?

pmixer · 2023-02-25T13:12:47Z

Code is not well formatted, guess you mean this line:

https://github.com/pmixer/SASRec.pytorch/blob/master/model.py#L81

Personally, I believe you can try to make Q, K, V w/ or w/o layernorm in experiments, it's not required for doing so.

Well, as query comes as last layer projected result, it's better to make it numerically stable for easier training.

K and V could also be layernormed, but as they are used for generating dot product to get query weights, I guess layernorm may not greatly effect these weights.

In summary, pls try to make some modification in your own experiments and draw some conclusion based on experiments result which is most reliable and fruitful, one is not forced to obey all the settings in current implementation, some settings comes as empirical stuff("it works well, so I keep using it in this way").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

YTZ01 commented Feb 25, 2023

pmixer commented Feb 25, 2023

When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

When using multihead_attention, why does the queries are normalized while keys and values are not ? #33

Comments

YTZ01 commented Feb 25, 2023

pmixer commented Feb 25, 2023