Gemma 2 returns NaN when using default attn (sdpa) with padding #32390

chanind · 2024-08-02T12:28:19Z

System Info

Python 3.10
Transformers 4.43.3
Linux (Colab notebook)

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The default gemma 2 2b attn results in NaN for padding tokens. A simple demo can be seen below (also reproduced in this colab notebook):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")

inputs = tokenizer(["Hello I am a couch", "cats"], return_tensors="pt", padding=True).to('cuda')
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

print(outputs.logits)

This returns the following

tensor([[[-24.3121,  -8.7513,  -6.9736,  ..., -18.3960, -17.4268, -24.3171],
         [-16.8873,  -4.7767,   5.8828,  ...,  -9.4981,  -9.3307, -16.7723],
         [-18.3313,   1.3191,  -4.6598,  ...,  -2.4244,   1.6774, -18.2153],
         [-18.9110,  -5.8708, -11.7827,  ...,  -5.6606,  -4.2607, -18.8535],
         [-20.1359,  -8.4194, -15.1834,  ..., -13.0231, -11.8288, -19.9716],
         [-16.8807,   5.8885,   0.1881,  ...,  -3.7045,  -6.0659, -16.8421]],
        [[     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan],
         [     nan,      nan,      nan,  ...,      nan,      nan,      nan]]],
       device='cuda:0')

This can be fixed by changing the attn_implementation to anything except sdpa

Expected behavior

Using padding should not result in NaN for normal inputs to gemma 2 2b

The text was updated successfully, but these errors were encountered:

qubvel · 2024-08-02T16:36:27Z

Hi @chanind, thanks for reporting the issue!

This is indeed a problem of scaled_dot_product_attention in PyTorch

scaled_dot_product_attention produces nans with boolean attn_mask with zero rows. pytorch/pytorch#103963

The cause of nan is how softmax is computed over full-masked rows in the attention mask and I hope it will be fixed in future versions of PyTorch, here is a related PR

Add a private _safe_softmax pytorch/pytorch#131060

Also, a similar issue has been reported previously

sdpa for bert casues nan when using bfloat16 with padding. #31035

Besides switching to eager/flash_attnetion_2 you could also try

Use float16 dtype.

model = AutoModelForCausalLM.from_pretrained(
     "google/gemma-2-2b", device_map="auto", torch_dtype=torch.float16
)

Modify attn_mask min value.

As suggested in the above issue, we can modify attn_mask to use another min value instead of torch.finfo(dtype).min, for example, torch.finfo(dtype).min / 2. To apply this, find min_dtype = torch.finfo(dtype).min in gemma modeling file and replace it with torch.finfo(dtype).min / 2.

Meanwhile, we will try to fix it on our side, thanks!

ArthurZucker · 2024-08-03T09:47:50Z

More than this, it's expected as the sdpa path does not support logit soft-capping (For Gemma2).
We do already take into account the sdpa bug when creating the mask @qubvel see here:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 1063 to 1072 in c1aa0ed

    
           # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward 
        
           if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions: 
        
               if AttentionMaskConverter._ignore_causal_mask_sdpa( 
        
                   attention_mask, 
        
                   inputs_embeds=input_tensor, 
        
                   past_key_values_length=past_seen_tokens, 
        
                   is_training=self.training, 
        
               ): 
        
                   return None

Which should be propagated to Gemma2. (it was not there for some reason my bad here)

ArthurZucker · 2024-08-03T10:15:26Z

Related to #31303

qubvel · 2024-08-03T21:03:43Z

@ArthurZucker thanks for the updated info!

yaolu-zjut · 2024-08-09T09:40:29Z

Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan:

so what's wrong? I using the same code to finetune llama3-8b and it works well.
This is my settings:

EMZEDI · 2024-08-15T20:03:45Z

Same issue here running the code for hooking the activations of the model. Using float16 made it work.

ArthurZucker · 2024-08-19T14:09:42Z

Hey! Make sure you are using eager or flash_attention_2 not sdpa!

Shengyun-Si · 2024-08-22T23:44:30Z

Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:

hi i have the same issue. How do you solve it? 😊

yaolu-zjut · 2024-08-27T16:58:42Z

Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:

hi i have the same issue. How do you solve it? 😊

Hi, I just use eager instead of sdpa like this: model = AutoModelForCausalLM.from_pretrained(args.prune_model_path,
trust_remote_code=True, device_map=device_map, attn_implementation="eager"
)

github-actions · 2024-09-21T08:06:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

chanind added the bug label Aug 2, 2024

ArthurZucker added a commit that referenced this issue Aug 3, 2024

Fix #32390

e22d913

ArthurZucker mentioned this issue Aug 3, 2024

Fix gemma2 with sdpa? #32403

Closed

5 tasks

ArthurZucker mentioned this issue Aug 3, 2024

Parallel inference on generative models throws an exception #32217

Closed

4 tasks

justinwangx mentioned this issue Aug 14, 2024

gemma-2 support / sdpa NaN error andyzoujm/representation-engineering#53

Open

andyrdt mentioned this issue Aug 27, 2024

Currently not working with Gemma 2 models andyrdt/refusal_direction#4

Open

github-actions bot closed this as completed Sep 29, 2024

SalmanMohammadi mentioned this issue Nov 3, 2024

feat: add gemma2 variants pytorch/torchtune#1835

Merged

13 tasks

tomaarsen mentioned this issue Jan 21, 2025

encode Function Returns NaN Embeddings for Some Tokens When Called on All Vocabulary UKPLab/sentence-transformers#2886

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 2 returns NaN when using default attn (sdpa) with padding #32390

Gemma 2 returns NaN when using default attn (sdpa) with padding #32390

chanind commented Aug 2, 2024

qubvel commented Aug 2, 2024 •

edited

Loading

ArthurZucker commented Aug 3, 2024 •

edited

Loading

ArthurZucker commented Aug 3, 2024

qubvel commented Aug 3, 2024

yaolu-zjut commented Aug 9, 2024

EMZEDI commented Aug 15, 2024

ArthurZucker commented Aug 19, 2024

Shengyun-Si commented Aug 22, 2024

yaolu-zjut commented Aug 27, 2024

github-actions bot commented Sep 21, 2024

Gemma 2 returns NaN when using default attn (sdpa) with padding #32390

Gemma 2 returns NaN when using default attn (sdpa) with padding #32390

Comments

chanind commented Aug 2, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

qubvel commented Aug 2, 2024 • edited Loading

ArthurZucker commented Aug 3, 2024 • edited Loading

ArthurZucker commented Aug 3, 2024

qubvel commented Aug 3, 2024

yaolu-zjut commented Aug 9, 2024

EMZEDI commented Aug 15, 2024

ArthurZucker commented Aug 19, 2024

Shengyun-Si commented Aug 22, 2024

yaolu-zjut commented Aug 27, 2024

github-actions bot commented Sep 21, 2024

qubvel commented Aug 2, 2024 •

edited

Loading

ArthurZucker commented Aug 3, 2024 •

edited

Loading