Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227

Open
khang-nguyen2907 opened this issue Apr 6, 2022 · 0 comments

Comments

@khang-nguyen2907
Copy link

Hi,

I am building a project that needs longformer. However, my attention_mask is a size of [batch, seq_len, seq_len], it is not a size of [batch, seq_len] as usual. I am really confused and do not know how to tackle it when I see these line of code:

def forward(self, hidden_states, attention_mask=None, head_mask=None):
'''
The `attention_mask` is changed in BertModel.forward from 0, 1, 2 to
-ve: no attention
0: local attention
+ve: global attention
'''
if attention_mask is not None:
attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)
key_padding_mask = attention_mask < 0
extra_attention_mask = attention_mask > 0
remove_from_windowed_attention_mask = attention_mask != 0

I know when the computation step comes to SelfAttention, the attention_mask which has the size of [batch, seq_len] is extended as [: , None, None, :], but my [batch, seq_len, seq_len] attention_mask does not make sense to do squeeze() like this. I also read the source code of longformer in Transformers HuggingFace and run the code with my attention_mask, it also bring out error because of attention_mask's dimension from these lines of code:
transformers/models/longformer/modeling_longformer.py#L587-L597

# values to pad for attention probs
remove_from_windowed_attention_mask = (attention_mask != 0)[:, :, None, None]

# cast to fp32/fp16 then replace 1's with -inf
float_mask = remove_from_windowed_attention_mask.type_as(query_vectors).masked_fill(
    remove_from_windowed_attention_mask, -10000.0
)
# diagonal mask with zeros everywhere and -inf inplace of padding
diagonal_mask = self._sliding_chunks_query_key_matmul(
    float_mask.new_ones(size=float_mask.size()), float_mask, self.one_sided_attn_window_size
)

If I use my attention_mask, the remove_from_windowed_attention_mask will be the size of [batch, seq_len, 1, 1, seq_len] and the ValueError: too many values to unpack (expected 4) appears when executing these lines of code:
transformers/models/longformer/modeling_longformer.py#L802-L808

def _sliding_chunks_query_key_matmul(self, query: torch.Tensor, key: torch.Tensor, window_overlap: int):
        """
        Matrix multiplication of query and key tensors using with a sliding window attention pattern. This
        implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer) with an
        overlap of size window_overlap
        """
        batch_size, seq_len, num_heads, head_dim = query.size()

In short, at 2 source codes of LongformerSelfAttention, I always get into trouble because of my 3-dimensionalattention_mask. I would be grateful if you could help.

Thanks,
Khang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant