New tokenizer code in transformer 3.0.0 is creating error with old code #5377

llStringll · 2020-06-29T18:20:28Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): BeRT and GPT2 for Poly-encoder implementation

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)
Ubuntu V2.0 corpus dataset, implementing Poly-encoder pipeline, everything was done, was re-training the model again to verify the results of the first training process.

To reproduce

Steps to reproduce the behavior:
Happens when using some_tokenizer_fromHF.encode_plus(), below is the eval script to test on custom input text, is exactly same as the one reading from dataset during training(simplified for eval)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
c_text="what is your name. I am Gloid"
context=tokenizer.encode_plus(c_text,
                                      text_pair=None,
                                      add_special_tokens=True,
                                      max_length=max_len,
                                      pad_to_max_length=False)
texts=["Obama", "Trump", "Eminem", "slender man", "Pewdiepie"]
for text in texts:
      tokenized_dict = tokenizer.encode_plus(text,
                                                  text_pair=None,
                                                  add_special_tokens=True,
                                                  max_length=max_len,
                                                  pad_to_max_length=True)

The error is -
"Truncation was not explicitely activated but max_length is provided a specific value, please use truncation=True to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior."
Repeated a total of 6 times, i.e., for every sequence passed into encode_plus

Expected behavior

Not to give this error, and return the input ids, segment ids, and input masks
The issue is completely identical to the closed issue - #5155

Environment info

transformers version: 3.0.0
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.1+cu101 (True)
Tensorflow version (GPU?): 2.2.0 (True)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

LysandreJik · 2020-06-29T18:23:34Z

Hi, this is not an error but a warning. If you want to disable warnings, you can use the following:

import logging

logging.basicConfig(level=logging.ERROR)

llStringll · 2020-06-29T18:27:11Z

Oh, I'm sorry for writing "error" everywhere, but I want to know, is this default behaviour correct for BeRT, it says by default it'll use only_first

LysandreJik · 2020-06-29T18:29:23Z

You can read the documentation concerning that method here.

Here's the part you're probably interested in:

‘only_first’: truncate to a max length specified in max_length or to the max acceptable input length for the model if no length is provided (max_length=None). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,

Since you're only using a single sentence, it seems to be what you're looking for?

llStringll · 2020-06-29T18:33:08Z

I am also concatenating multiple "context" sequences using [SEP] token, I'm just feeding sequences into encode_plus and stripping off the [CLS] token at the beginning and concatenating the rest with the old one, making it
[CLS]seq1[SEP]seq2[SEP]
I assume, even earlier, in older version, when this warning wasnt being logged to the terminal, it still used only_first, did it?

LysandreJik · 2020-06-29T18:37:11Z

Is there a reason why you're not using the encode_plus method with your pairs? The tokenizer will automatically build the pairs as the model expects them.

If you pass a single sequence and want to build them with the special tokens yourself, you can use the add_special_tokens=False flag. No need to strip the special tokens then.

Be careful as since you're building the sequences yourself, if you truncate these some of the special tokens might be cut off.

llStringll · 2020-06-29T18:55:20Z

This is my snippet, "texts" is a list of strings

def __call__(self, texts):
    input_ids_list, segment_ids_list, input_masks_list = [], [], []

    for text in texts[::-1][:self.max_history]:
      tokenized_dict = self.tokenizer.encode_plus(text,
                                                  text_pair=None,
                                                  add_special_tokens=True,
                                                  max_length=self.max_len,
                                                  pad_to_max_length=False)
      input_ids, input_masks = tokenized_dict['input_ids'], tokenized_dict['attention_mask']
      segment_ids = [1] * len(input_ids)
      if len(input_ids_list) > 0:
        input_ids = input_ids[1:]
        segment_ids = segment_ids[1:]
        input_masks = input_masks[1:]
      input_ids_list.extend(input_ids)
      segment_ids_list.extend(segment_ids)
      input_masks_list.extend(input_masks)

      if len(input_ids_list) >= self.max_len:
        input_ids_list = input_ids_list[:self.max_len - 1] + [self.sep_id]
        segment_ids_list = segment_ids_list[:self.max_len]
        input_masks_list = input_masks_list[:self.max_len]
        break
    input_ids_list += [self.pad_id] * (self.max_len - len(input_ids_list))
    segment_ids_list += [0] * (self.max_len - len(segment_ids_list))
    input_masks_list += [0] * (self.max_len - len(input_masks_list))

    assert len(input_ids_list) == self.max_len
    assert len(segment_ids_list) == self.max_len
    assert len(input_masks_list) == self.max_len

    return input_ids_list, segment_ids_list, input_masks_list

I'm not truncating anything after I create my full sequence. That suggestion was great for using text_pair, I dont know why I didnt think of that. Thank you
PS- Is this way of creating input ids correct, and in the older version, when this warning wasn't being logged to the terminal, was it using only_first even then?

LysandreJik · 2020-06-29T19:38:55Z

In that snippet, are you trying to concatenate a lot of sequences together? If you have 10 sequences in your text, you want to have a giant input_ids_list containing all the 10 sequences separated by a separator token?

llStringll · 2020-06-30T13:20:45Z

Yes, exactly, thats what I am doing, and I am then strippin off earlier later part, coz I am flippin the list too. Basically the list is a conversation where I am making a new token list out of recent N words of conversation

amaiya · 2020-07-02T17:28:58Z

For anyone stumbling across this issue and having problems with sentence pair classification in v3.0.0:

In v3.0.0, the default truncation strategy was changed, which causes code that used to work in v2.11.0 to break in some cases.
v2.11.0: default truncation strategy is longest_first
v.3.0.0: truncation strategy appears to default to only_first

For sentence pair classification in v3.0.0, this can result in a failure to truncate sentence pair to the supplied max_length parameter, which can break downstream model or other code:

W0702 12:56:50.435204 140139424331584 tokenization_utils_base.py:1447] Truncation was not explicitely activated but 
`max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length.
 Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may 
want to check this is the right behavior.
E0702 12:56:50.437675 140139424331584 tokenization_utils.py:784] We need to remove 25 to truncate the input but the first
 sequence has a length 17. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 
'longest_first' or 'only_second'.

For example, the following code prints 32 in v2.11.0, but 57 in v3.0.0:

text_a = '''Debunk this: Six Corporations Control $NUMBER$% Of The Media In America'''
text_b = '''
I can't believe people are missing the two obvious flaws in this analysis. 
This infographic doesn't show that $NUMBER$ companies control $NUMBER$% of the media. '''
from transformers import *
t = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
output = t.encode_plus(text_a, text_b, max_length=32)
print(len(output['input_ids']))

The solution is to explicitly provide truncate='longest_first:, as indicated in the warning:

output = t.encode_plus(text_a, text_b, max_length=32, truncation='longest_first')

thomwolf · 2020-07-02T21:36:40Z

Good point. We will release a patch to fix this breaking change (move back to having longest_first as default) plus the one mentioned in #5447 probably tomorrow or early next week.

amaiya · 2020-07-02T21:58:34Z

@thomwolf: Thanks - changing the default back to longest_first may also address #5460

LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Jun 29, 2020

LysandreJik mentioned this issue Jun 29, 2020

new tokenizer backend breaks old code #5155

Closed

2 tasks

thomwolf mentioned this issue Jul 2, 2020

Where did "prepare_for_model" go? What is the replacement? #5447

Closed

LysandreJik mentioned this issue Jul 2, 2020

Exposing prepare_for_model for both slow & fast tokenizers #5479

Merged

thomwolf closed this as completed in #5479 Jul 3, 2020

githubrandomuser2017 mentioned this issue Jul 3, 2020

3.0.1 BertTokenizer batch_encode_plus() shows warnings "Truncation was not explicitely activated but max_length is provided a specific value" #5505

Closed

4 tasks

cceyda mentioned this issue Jul 22, 2020

Duplicate grouped entities when using 'ner' pipeline #5609

Closed

4 tasks

vlad-karpukhin mentioned this issue May 27, 2021

Question about "Selecting a truncation strategy" facebookresearch/DPR#150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New tokenizer code in transformer 3.0.0 is creating error with old code #5377

New tokenizer code in transformer 3.0.0 is creating error with old code #5377

llStringll commented Jun 29, 2020

LysandreJik commented Jun 29, 2020

llStringll commented Jun 29, 2020

LysandreJik commented Jun 29, 2020

llStringll commented Jun 29, 2020 •

edited

Loading

LysandreJik commented Jun 29, 2020

llStringll commented Jun 29, 2020 •

edited

Loading

LysandreJik commented Jun 29, 2020

llStringll commented Jun 30, 2020

amaiya commented Jul 2, 2020 •

edited

Loading

thomwolf commented Jul 2, 2020

amaiya commented Jul 2, 2020

New tokenizer code in transformer 3.0.0 is creating error with old code #5377

New tokenizer code in transformer 3.0.0 is creating error with old code #5377

Comments

llStringll commented Jun 29, 2020

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

LysandreJik commented Jun 29, 2020

llStringll commented Jun 29, 2020

LysandreJik commented Jun 29, 2020

llStringll commented Jun 29, 2020 • edited Loading

LysandreJik commented Jun 29, 2020

llStringll commented Jun 29, 2020 • edited Loading

LysandreJik commented Jun 29, 2020

llStringll commented Jun 30, 2020

amaiya commented Jul 2, 2020 • edited Loading

thomwolf commented Jul 2, 2020

amaiya commented Jul 2, 2020

llStringll commented Jun 29, 2020 •

edited

Loading

llStringll commented Jun 29, 2020 •

edited

Loading

amaiya commented Jul 2, 2020 •

edited

Loading