Question about padding the input sequence #294

BaleChen · 2023-07-20T07:29:44Z

Lines 90 to 99 in 761dc5b

    
           tokenized_list = [ 
        
               tokenizer( 
        
                   text, 
        
                   return_tensors="pt", 
        
                   padding="longest", 
        
                   max_length=tokenizer.model_max_length, 
        
                   truncation=True, 
        
               ) 
        
               for text in strings 
        
           ]

In this snippet of code, from what I understand, the padding is not added since using "longest" mode on a single sequence is equivalent to adding no paddings as per this doc. Is it right? So the padding for each prompt is added by the data collator instead of here.

I wonder if it would be clearer if you just write padding=False here or add a comment about it.

srhthu · 2023-09-01T05:04:11Z

I think so.. Actually they use the dynamic padding by the "DataCollatorForSupervisedDataset". My concern is should the padding tokens be at left rather than right? The other repo https://github.com/tloen/alpaca-lora padding to the left, which makes sense for batch training.

maksimstw · 2023-09-23T02:15:36Z

Agree with @srhthu. I think left padding makes more sense, but the train.py used right padding instead. I think the code they use to train Alpaca is simply not correct for batch training. See the explanation here.

BaleChen · 2023-09-23T03:33:50Z

Hey @maksimstw,

My previous understanding is that batch inference with decoder models requires us to do left padding. But at the fine-tuning stage, right-side padding is okay as long as we set the attention mask correctly and turn pad tokens to -100 when calculating loss.

Is it the case that we can just simply use left padding for both training and inference in generation tasks?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about padding the input sequence #294

Question about padding the input sequence #294

BaleChen commented Jul 20, 2023

srhthu commented Sep 1, 2023

maksimstw commented Sep 23, 2023 •

edited

Loading

BaleChen commented Sep 23, 2023

Question about padding the input sequence #294

Question about padding the input sequence #294

Comments

BaleChen commented Jul 20, 2023

srhthu commented Sep 1, 2023

maksimstw commented Sep 23, 2023 • edited Loading

BaleChen commented Sep 23, 2023

maksimstw commented Sep 23, 2023 •

edited

Loading