Long T5 #179

HaokunLiu · 2021-03-24T15:03:44Z

Based on @AkshitaB 's work (#149), this PR extends Longformer to T5. It also adds a test to check if the Longformer T5 produces the same output as the standard T5 on short input texts, as suggested by @ibeltagy in this comment

A quick thing about code style: I'm not sure if this repo has selected any formatter previously. I didn't find dev-requirements.txt. So I continue to use the black formatter in my default setting. It automatically re-formats the file whenever I save it. You may notice changes like ' -> ", or breaking a long line into multiple lines. I hope it doesn't bother you too much.

HaokunLiu · 2021-03-24T15:12:42Z

longformer/longformer.py



 class LongformerSelfAttention(nn.Module):
-    def __init__(self, config, layer_id):
+    def __init__(self, config, layer_id, bias=True, attention_dim_scale=True):


T5 attention module is slightly different from conventional ones. It doesn't have bias, nor does it scale the attention score according to attention head dimension before softmax. See this list for more details.

In the default option, bias=True, attention_dim_scale=True. This should just fall back to regular self-attention.

Please add your comment to the code.

HaokunLiu · 2021-03-24T15:32:56Z

longformer/longformer.py

            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000
            # concat to attn_weights
-            # (bsz, seq_len, num_heads, extra attention count + 2*window+1)
+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)


changed annotation to be consistent with related annotations below

HaokunLiu · 2021-03-24T15:34:22Z

longformer/longformer.py

@@ -78,28 +89,38 @@ def __init__(self, config, layer_id):
        self.attention_dilation = config.attention_dilation[self.layer_id]
        self.attention_mode = config.attention_mode
        self.autoregressive = config.autoregressive
+
+        if hasattr(config, "relative_attention_num_buckets") and layer_id == 0:


In T5, the position bias is shared across layers. This is done by letting the first layer compute the position bias, then pass it on to the remaining layers.

Good catch. Please write this comment in the code for more readablity.

HaokunLiu · 2021-03-24T15:36:23Z

longformer/longformer.py

+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        if self.has_relative_attention_bias:
+            outputs = outputs + (position_bias,)


this is equivalent to the old output form, when self.has_relative_attention_bias=False

HaokunLiu · 2021-03-24T15:37:17Z

longformer/longformer.py

        return outputs
+
+
+def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):


I was considering moving this to longformer_encoder_decoder, but that will lead to cycle import, so this has to be here.

HaokunLiu · 2021-03-24T15:53:47Z

longformer/longformer_encoder_decoder.py

+                layer.layer[0].SelfAttention = LongformerSelfAttentionForT5(config, layer_id=i)
+
+
+class LongformerT5Config(T5Config):


You can see, we are getting many highly-similar config classes as we extending to other transformer models. If you like, we can simplify this by using Mixin. It will be like having another Mixin class containing all the longformer specific settings, and the LongformerT5Config class will inherit both the Mixin class and T5Config.

I don't have strong feelings about this. You decide (as long as we don't change the interface of the released code)

HaokunLiu · 2021-03-24T15:59:29Z

longformer/longformer_encoder_decoder.py

+        )
+        self.output = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
+
+    def forward(


An alternative I considered was to let this class inherit LongformerSelfAttention. But eventually, I decided not to do so. The interfaces of the two classes are quite different. What we have here, i.e., making LongformerSelfAttention a member of the LongformerSelfAttentionForT5, is probably less confusing than the althernative.

HaokunLiu · 2021-03-24T16:12:58Z

longformer/longformer.py

            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        if position_bias is None and self.has_relative_attention_bias:


since the sliding window already has put the attention score in the form of [q_(i) * k_(i-w), q_(i) * k_(i-w+1), ..., q_(i) * k_(i), ... , q_(i) * k_(i+w)] the relative position is simply arange

please move this comment to the code.

nit: Maybe also move this block of code to a separate function

HaokunLiu · 2021-03-24T16:18:00Z

longformer/longformer.py

+            perm_global_position_bias = attn_weights.new_zeros(
+                bsz, max_num_extra_indices_per_batch, seq_len, self.num_heads
+            )  # (bsz, max_num_extra_indices_per_batch, seq_len, num_heads)
+            if extra_attention_mask is not None:


Global position bias is a bit more complex. We first get the memory position from extra_attention_mask_nonzeros, then compute the query position using arrange. Their diff is the relative position. But this "sparse" one vector for each global token in the batch. So we later put it back into the shape of (bsz, max_num_extra_indices_per_batch, ...) using the index information from selection_padding_mask_nonzeros

didn't review this part yet.

ibeltagy

Looks great, thank you.
I left a few small comments. I didn't review the global attention part yet, will do later, maybe today.

ibeltagy · 2021-03-25T00:37:32Z

longformer/longformer.py

@@ -78,28 +89,38 @@ def __init__(self, config, layer_id):
        self.attention_dilation = config.attention_dilation[self.layer_id]
        self.attention_mode = config.attention_mode
        self.autoregressive = config.autoregressive
+
+        if hasattr(config, "relative_attention_num_buckets") and layer_id == 0:


Good catch. Please write this comment in the code for more readablity.

ibeltagy · 2021-03-25T00:38:19Z

longformer/longformer.py



 class LongformerSelfAttention(nn.Module):
-    def __init__(self, config, layer_id):
+    def __init__(self, config, layer_id, bias=True, attention_dim_scale=True):


Please add your comment to the code.

ibeltagy · 2021-03-25T00:40:06Z

longformer/longformer.py

            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000
            # concat to attn_weights
-            # (bsz, seq_len, num_heads, extra attention count + 2*window+1)
+            # (bsz, seq_len, num_heads, max_num_extra_indices_per_batch + 2*window+1)


ibeltagy · 2021-03-25T14:44:39Z

longformer/longformer.py

            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        if position_bias is None and self.has_relative_attention_bias:


please move this comment to the code.

ibeltagy · 2021-03-25T14:45:18Z

longformer/longformer.py

            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        if position_bias is None and self.has_relative_attention_bias:


nit: Maybe also move this block of code to a separate function

ibeltagy · 2021-03-25T14:45:39Z

longformer/longformer.py

+            perm_global_position_bias = attn_weights.new_zeros(
+                bsz, max_num_extra_indices_per_batch, seq_len, self.num_heads
+            )  # (bsz, max_num_extra_indices_per_batch, seq_len, num_heads)
+            if extra_attention_mask is not None:


didn't review this part yet.

ibeltagy · 2021-03-25T14:50:49Z

tests/test_t5_short_sequence.py

+            base_model_name_or_path="t5-small",
+        )
+        self._run_test(
+            INPUT_TEXT="It begins with the Great Hungerer. It ends in utter darkeness.",


ibeltagy · 2021-03-25T14:59:00Z

tests/test_t5_short_sequence.py

+    def test_outout(self):
+        self._run_test(
+            INPUT_TEXT="Hello world!",
+            long_model_name_or_path="/net/nfs2.s2-research/haokunl/exp_files/model_artifacts/t5/longt5-small-4096",


It would be great if this test works without the local model. One way to do so is to call create_long_model in the text to convert t5 to long, then test it. It will make the test slower but easier to run.

ibeltagy · 2021-03-25T14:59:51Z

longformer/longformer_encoder_decoder.py

+                layer.layer[0].SelfAttention = LongformerSelfAttentionForT5(config, layer_id=i)
+
+
+class LongformerT5Config(T5Config):


I don't have strong feelings about this. You decide (as long as we don't change the interface of the released code)

ibeltagy · 2021-03-25T15:04:05Z

longformer/longformer.py

@@ -5,14 +5,17 @@
 import torch.nn.functional as F
 from longformer.diagonaled_mm_tvm import diagonaled_mm as diagonaled_mm_tvm, mask_invalid_locations
 from longformer.sliding_chunks import sliding_chunks_matmul_qk, sliding_chunks_matmul_pv
-from longformer.sliding_chunks import sliding_chunks_no_overlap_matmul_qk, sliding_chunks_no_overlap_matmul_pv
+from longformer.sliding_chunks import (


It is fine that your dev env changed the file format. I know it doesn't change the code but I will feel more comfortable if you run a small test to make sure the new code produces the same output as the previous one for Longformer.

armancohan · 2021-03-26T18:56:10Z

scripts/convert_t5_to_longformerencoderdecoder.py

+    # in T5 attention_probs_dropout_prob is dropout_rate
+    config.attention_probs_dropout_prob = config.dropout_rate
+    config.attention_window = [attention_window] * config.num_hidden_layers
+    config.attention_dilation = [1] * config.num_hidden_layers


when increasing the model length we probably want to increase number of relative position buckets as well config.relative_attention_num_buckets

AkshitaB and others added 9 commits December 7, 2020 23:58

adding t5 options to longformer

572da07

t5 encoder decoder options

2b44bf8

adding convert script

2c3860a

add t5

7f870e8

tidy imports in convert_t5_to_longformer_encoderdecoder

5e63b05

add some prints for easy inspection of model architecture

4a4997e

change line length

e4f8e49

clean up

4c8864e

add missing annotation

7b635a2

HaokunLiu commented Mar 24, 2021

View reviewed changes

HaokunLiu requested review from ibeltagy, armancohan and kyleclo March 24, 2021 16:18

ibeltagy approved these changes Mar 25, 2021

View reviewed changes

armancohan reviewed Mar 26, 2021

View reviewed changes

update

ff939af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long T5 #179

Long T5 #179

HaokunLiu commented Mar 24, 2021

HaokunLiu Mar 24, 2021

HaokunLiu Mar 24, 2021

ibeltagy Mar 25, 2021

HaokunLiu Mar 24, 2021

ibeltagy Mar 25, 2021

HaokunLiu Mar 24, 2021

ibeltagy Mar 25, 2021

HaokunLiu Mar 24, 2021

HaokunLiu Mar 24, 2021

HaokunLiu Mar 24, 2021

ibeltagy Mar 25, 2021

HaokunLiu Mar 24, 2021

HaokunLiu Mar 24, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

HaokunLiu Mar 24, 2021

ibeltagy Mar 25, 2021

ibeltagy left a comment

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

ibeltagy Mar 25, 2021

armancohan Mar 26, 2021

		return outputs


		def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):

		layer.layer[0].SelfAttention = LongformerSelfAttentionForT5(config, layer_id=i)


		class LongformerT5Config(T5Config):

		attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)

		if position_bias is None and self.has_relative_attention_bias:

Long T5 #179

Are you sure you want to change the base?

Long T5 #179

Conversation

HaokunLiu commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibeltagy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment