How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227

khang-nguyen2907 · 2022-04-06T02:21:45Z

Hi,

I am building a project that needs longformer. However, my attention_mask is a size of [batch, seq_len, seq_len], it is not a size of [batch, seq_len] as usual. I am really confused and do not know how to tackle it when I see these line of code:

longformer/longformer/longformer.py

Lines 80 to 91 in 265314d

 def forward(self, hidden_states, attention_mask=None, head_mask=None): 

 ''' 

  The `attention_mask` is changed in BertModel.forward from 0, 1, 2 to 

  -ve: no attention 

  0: local attention 

  +ve: global attention 

  ''' 

 if attention_mask is not None: 

 attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1) 

 key_padding_mask = attention_mask < 0 

 extra_attention_mask = attention_mask > 0 

 remove_from_windowed_attention_mask = attention_mask != 0

I know when the computation step comes to SelfAttention, the attention_mask which has the size of [batch, seq_len] is extended as [: , None, None, :], but my [batch, seq_len, seq_len] attention_mask does not make sense to do squeeze() like this. I also read the source code of longformer in Transformers HuggingFace and run the code with my attention_mask, it also bring out error because of attention_mask's dimension from these lines of code:
transformers/models/longformer/modeling_longformer.py#L587-L597

# values to pad for attention probs
remove_from_windowed_attention_mask = (attention_mask != 0)[:, :, None, None]

# cast to fp32/fp16 then replace 1's with -inf
float_mask = remove_from_windowed_attention_mask.type_as(query_vectors).masked_fill(
    remove_from_windowed_attention_mask, -10000.0
)
# diagonal mask with zeros everywhere and -inf inplace of padding
diagonal_mask = self._sliding_chunks_query_key_matmul(
    float_mask.new_ones(size=float_mask.size()), float_mask, self.one_sided_attn_window_size
)

If I use my attention_mask, the remove_from_windowed_attention_mask will be the size of [batch, seq_len, 1, 1, seq_len] and the ValueError: too many values to unpack (expected 4) appears when executing these lines of code:
transformers/models/longformer/modeling_longformer.py#L802-L808

def _sliding_chunks_query_key_matmul(self, query: torch.Tensor, key: torch.Tensor, window_overlap: int):
        """
        Matrix multiplication of query and key tensors using with a sliding window attention pattern. This
        implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer) with an
        overlap of size window_overlap
        """
        batch_size, seq_len, num_heads, head_dim = query.size()

In short, at 2 source codes of LongformerSelfAttention, I always get into trouble because of my 3-dimensionalattention_mask. I would be grateful if you could help.

Thanks,
Khang

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227

How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227

khang-nguyen2907 commented Apr 6, 2022

How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227

How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227

Comments

khang-nguyen2907 commented Apr 6, 2022