Make sure that all attention works the same #5360

dirkgr · 2021-08-16T21:03:25Z

Addresses #5345.

Added ScaledDotProductMatrixAttention, and converted the transformer toolkit to use it
Added tests to ensure that all Attention and MatrixAttention implementations are interchangeable
Fixed the signature of ScaledDotProductAttention to match the other Attention classes.

…it's supposed to

…ennlp into AttentionToAttention

AkshitaB

LGTM. Just a question about the default scoring function.

AkshitaB · 2021-08-17T20:07:43Z

allennlp/modules/transformer/attention_module.py

@@ -478,7 +473,7 @@ def __init__(
            attention_head_size=key_value_proj_dim,
            num_attention_heads=num_heads,
            output_linear=True,
-            scoring_func="scaled_dot_product",
+            scoring_func="dot_product",


Why do we change the default? The original transformer uses scaled_dot_product, right?

I asked you that on Slack. The original transformer uses scaled_, as per the paper, but in your implementation (which matches HF), the scaling factor is forced to 1, so it doesn't scale at all. I continue to be confused about this.

Sorry, I missed that. From what I can see, scaling factor is set to 1 for T5Attention, not for regular SelfAttention. I believe original T5 does the same. By default, we set a scaling factor for regular SelfAttention.

Right. This is also only for T5Attention. I believe I left it the same by default.

So I guess it's just T5 being extra?

I re-read section 2.1 of the T5 paper, and it doesn't mention this at all 🤷🏼‍♂️.

Yes, a whole lot of finicky little training details aren't mentioned in the 60+ pages paper. I think we were following the HF implementation.

dirkgr · 2021-08-17T20:22:28Z

allennlp/modules/transformer/attention_module.py

@@ -487,8 +482,6 @@ def __init__(
            relative_attention_num_buckets=relative_attention_num_buckets,
        )

-        self.attn = Attention.by_name(self.scoring_func)(scaling_factor=1, normalize=False)


@AkshitaB, this is where the scaling factor is forced to 1.

This is for T5Attention.

dirkgr added 3 commits August 16, 2021 13:58

Adds a test to make sure that all attention works the same

d073da1

Merge branch 'main' into AttentionToAttention

b28a3fc

Autodetect scaling factor

23f0c84

dirkgr linked an issue Aug 16, 2021 that may be closed by this pull request

Scaled Dot Product Attention matmul error #5345

Closed

10 tasks

dirkgr added 5 commits August 17, 2021 11:25

Refactors attention so that scaled dot product attention lives where …

6b24f80

…it's supposed to

Formatting

f83ac8d

Merge branch 'AttentionToAttention' of https://github.com/allenai/all…

1d1a76a

…ennlp into AttentionToAttention

Formatting

4ee5098

Changelog

23c2ee9

dirkgr marked this pull request as ready for review August 17, 2021 18:46

dirkgr requested a review from AkshitaB August 17, 2021 18:46

AkshitaB approved these changes Aug 17, 2021

View reviewed changes

dirkgr commented Aug 17, 2021

View reviewed changes

dirkgr merged commit d45a2da into main Aug 17, 2021

dirkgr deleted the AttentionToAttention branch August 17, 2021 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure that all attention works the same #5360

Make sure that all attention works the same #5360

dirkgr commented Aug 16, 2021 •

edited

Loading

AkshitaB left a comment

AkshitaB Aug 17, 2021

dirkgr Aug 17, 2021

AkshitaB Aug 17, 2021

dirkgr Aug 17, 2021

dirkgr Aug 17, 2021

dirkgr Aug 17, 2021

AkshitaB Aug 18, 2021

dirkgr Aug 17, 2021

AkshitaB Aug 17, 2021

Make sure that all attention works the same #5360

Make sure that all attention works the same #5360

Conversation

dirkgr commented Aug 16, 2021 • edited Loading

AkshitaB left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Aug 16, 2021 •

edited

Loading