BlenderBot RuntimeError: CUDA error: device-side assert triggered #9046

manzar96 · 2020-12-11T00:35:24Z

Environment info

transformers version: 4.0.0
Platform: Linux-5.4.0-56-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes (GTX 1060 6GB)
Using distributed or parallel set-up in script?: no

Who can help

Information

Model I am using (Bert, XLNet ...): I am using the BlenderbotForConditionalGeneration ('facebook/blenderbot-90M') along with the relevant small tokenizer.

The problem arises when using:

I am using my own trainer implementation. I think that the problem has to do with the indexes of the labels. More specifically when I am using:

outputs = self.model(input_ids=inputs, attention_mask=inputs_att, labels=pad_targets, return_dict=True)

everything works fine as the "pad_targets" are the targets using 0 as the index for masked (padded) tokens.
However when I am using:

outputs = self.model(input_ids=inputs, attention_mask=inputs_att, labels=repl_targets, return_dict=True)
and then printing the outputs['loss'] the following error is occurred:

RuntimeError: CUDA error: device-side assert triggered

as the "repl_targets" are the targets using the -100 as the index for masked (padded) tokens.

The aforementioned error also occurs when using the argument:
decoder_input_ads=repl_targets

The tasks I am working on is:
Dialogue generation in Empathetic Dialogues dataset.

Expected behavior

I think that there is a problem with the -100 padding token. But I am not sure :)

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-12-11T07:22:44Z

Hey @manzar96,

It would be awesome if you could provide a full code snippet that I can copy paste and run to reproduce the error. I am not able to do so with your code above.

Thanks a lot!

manzar96 · 2020-12-11T15:38:39Z

I made an example:

from transformers import BlenderbotSmallTokenizer, \
    BlenderbotForConditionalGeneration

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

model = BlenderbotForConditionalGeneration.from_pretrained('facebook/blenderbot-90M')
model.to(DEVICE)
inputs = torch.tensor([[14, 49, 42, 626, 2727, 1063, 5, 0, 0, 0, 0, 0, 0, 0],
                       [14, 1322, 7, 1427, 13, 7, 153, 384, 5, 14,
                        18,   64, 7261,    5]], device=DEVICE)

inputs_att = torch.tensor([[1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
       device=DEVICE)

repl_targets = torch.tensor([[  46,   15, 3283,   20, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100],
        [ 121,   54,   37,   53,   60,   12,  447,   10, 1427,   15,   51,   11,
          598,   20]], device=DEVICE)

pad_targets = torch.tensor([[  46,   15, 3283,   20,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0],
        [ 121,   54,   37,   53,   60,   12,  447,   10, 1427,   15,   51,   11,
          598,   20]], device=DEVICE)


outputs=model.forward(input_ids=inputs, attention_mask=inputs_att,
                             labels=repl_targets, return_dict=True)
import ipdb;ipdb.set_trace()

If you try printing the outputs['loss'] the error occurs. However, if you replace the repl_targets with the pad_targets variable everything works fine (but the loss does not mask 0, so that's not always correct for use).

patil-suraj · 2020-12-14T11:28:54Z

@patrickvonplaten

This is a bug, in bart decoder_input_ids are prepared by shifting the labels to right, but it doesn't replace -100 with pad_token_id.

transformers/src/transformers/models/bart/modeling_bart.py

Lines 65 to 73 in 6587cf9

    
           def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int): 
        
               """ 
        
               Shift input ids one token to the right, and wrap the last non pad token (usually <eos>). 
        
               """ 
        
               prev_output_tokens = input_ids.clone() 
        
               index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1) 
        
               prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze() 
        
               prev_output_tokens[:, 1:] = input_ids[:, :-1] 
        
               return prev_output_tokens

In T5 we automatically replace -100 with pad_token_id when preparing decoder_input_ids.

transformers/src/transformers/models/t5/modeling_t5.py

Lines 740 to 756 in 6587cf9

    
           def _shift_right(self, input_ids): 
        
               decoder_start_token_id = self.config.decoder_start_token_id 
        
               pad_token_id = self.config.pad_token_id 
        
               assert ( 
        
                   decoder_start_token_id is not None 
        
               ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information" 
        
               # shift inputs to the right 
        
               shifted_input_ids = input_ids.new_zeros(input_ids.shape) 
        
               shifted_input_ids[..., 1:] = input_ids[..., :-1].clone() 
        
               shifted_input_ids[..., 0] = decoder_start_token_id 
        
               assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined." 
        
               # replace possible -100 values in labels by `pad_token_id` 
        
               shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)

patrickvonplaten · 2020-12-14T11:56:53Z

You're right @patil-suraj - do you want to open a PR to fix it in Bart? :-)

patil-suraj · 2020-12-14T12:40:42Z

Yeah!

patil-suraj self-assigned this Dec 14, 2020

patrickvonplaten linked a pull request Dec 15, 2020 that will close this issue

[Bart] fix bart loss masking #9131

Merged

5 tasks

patrickvonplaten closed this as completed in #9131 Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlenderBot RuntimeError: CUDA error: device-side assert triggered #9046

BlenderBot RuntimeError: CUDA error: device-side assert triggered #9046

manzar96 commented Dec 11, 2020 •

edited

Loading

patrickvonplaten commented Dec 11, 2020

manzar96 commented Dec 11, 2020 •

edited

Loading

patil-suraj commented Dec 14, 2020

patrickvonplaten commented Dec 14, 2020

patil-suraj commented Dec 14, 2020

BlenderBot RuntimeError: CUDA error: device-side assert triggered #9046

BlenderBot RuntimeError: CUDA error: device-side assert triggered #9046

Comments

manzar96 commented Dec 11, 2020 • edited Loading

Environment info

Who can help

Information

Expected behavior

patrickvonplaten commented Dec 11, 2020

manzar96 commented Dec 11, 2020 • edited Loading

patil-suraj commented Dec 14, 2020

patrickvonplaten commented Dec 14, 2020

patil-suraj commented Dec 14, 2020

manzar96 commented Dec 11, 2020 •

edited

Loading

manzar96 commented Dec 11, 2020 •

edited

Loading