Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tokenizer code in transformer 3.0.0 is creating error with old code #5377

Closed
2 of 4 tasks
llStringll opened this issue Jun 29, 2020 · 11 comments · Fixed by #5479
Closed
2 of 4 tasks

New tokenizer code in transformer 3.0.0 is creating error with old code #5377

llStringll opened this issue Jun 29, 2020 · 11 comments · Fixed by #5479
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@llStringll
Copy link

🐛 Bug

Information

Model I am using (Bert, XLNet ...): BeRT and GPT2 for Poly-encoder implementation

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)
    Ubuntu V2.0 corpus dataset, implementing Poly-encoder pipeline, everything was done, was re-training the model again to verify the results of the first training process.

To reproduce

Steps to reproduce the behavior:
Happens when using some_tokenizer_fromHF.encode_plus(), below is the eval script to test on custom input text, is exactly same as the one reading from dataset during training(simplified for eval)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
c_text="what is your name. I am Gloid"
context=tokenizer.encode_plus(c_text,
                                      text_pair=None,
                                      add_special_tokens=True,
                                      max_length=max_len,
                                      pad_to_max_length=False)
texts=["Obama", "Trump", "Eminem", "slender man", "Pewdiepie"]
for text in texts:
      tokenized_dict = tokenizer.encode_plus(text,
                                                  text_pair=None,
                                                  add_special_tokens=True,
                                                  max_length=max_len,
                                                  pad_to_max_length=True)

The error is -
"Truncation was not explicitely activated but max_length is provided a specific value, please use truncation=True to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior."
Repeated a total of 6 times, i.e., for every sequence passed into encode_plus

Expected behavior

Not to give this error, and return the input ids, segment ids, and input masks
The issue is completely identical to the closed issue - #5155

Environment info

  • transformers version: 3.0.0
  • Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.1+cu101 (True)
  • Tensorflow version (GPU?): 2.2.0 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No
@LysandreJik LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Jun 29, 2020
@LysandreJik
Copy link
Member

Hi, this is not an error but a warning. If you want to disable warnings, you can use the following:

import logging

logging.basicConfig(level=logging.ERROR)

@llStringll
Copy link
Author

Oh, I'm sorry for writing "error" everywhere, but I want to know, is this default behaviour correct for BeRT, it says by default it'll use only_first

@LysandreJik
Copy link
Member

You can read the documentation concerning that method here.

Here's the part you're probably interested in:

‘only_first’: truncate to a max length specified in max_length or to the max acceptable input length for the model if no length is provided (max_length=None). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,

Since you're only using a single sentence, it seems to be what you're looking for?

@llStringll
Copy link
Author

llStringll commented Jun 29, 2020

I am also concatenating multiple "context" sequences using [SEP] token, I'm just feeding sequences into encode_plus and stripping off the [CLS] token at the beginning and concatenating the rest with the old one, making it
[CLS]seq1[SEP]seq2[SEP]
I assume, even earlier, in older version, when this warning wasnt being logged to the terminal, it still used only_first, did it?

@LysandreJik
Copy link
Member

Is there a reason why you're not using the encode_plus method with your pairs? The tokenizer will automatically build the pairs as the model expects them.

If you pass a single sequence and want to build them with the special tokens yourself, you can use the add_special_tokens=False flag. No need to strip the special tokens then.

Be careful as since you're building the sequences yourself, if you truncate these some of the special tokens might be cut off.

@llStringll
Copy link
Author

llStringll commented Jun 29, 2020

This is my snippet, "texts" is a list of strings

def __call__(self, texts):
    input_ids_list, segment_ids_list, input_masks_list = [], [], []

    for text in texts[::-1][:self.max_history]:
      tokenized_dict = self.tokenizer.encode_plus(text,
                                                  text_pair=None,
                                                  add_special_tokens=True,
                                                  max_length=self.max_len,
                                                  pad_to_max_length=False)
      input_ids, input_masks = tokenized_dict['input_ids'], tokenized_dict['attention_mask']
      segment_ids = [1] * len(input_ids)
      if len(input_ids_list) > 0:
        input_ids = input_ids[1:]
        segment_ids = segment_ids[1:]
        input_masks = input_masks[1:]
      input_ids_list.extend(input_ids)
      segment_ids_list.extend(segment_ids)
      input_masks_list.extend(input_masks)

      if len(input_ids_list) >= self.max_len:
        input_ids_list = input_ids_list[:self.max_len - 1] + [self.sep_id]
        segment_ids_list = segment_ids_list[:self.max_len]
        input_masks_list = input_masks_list[:self.max_len]
        break
    input_ids_list += [self.pad_id] * (self.max_len - len(input_ids_list))
    segment_ids_list += [0] * (self.max_len - len(segment_ids_list))
    input_masks_list += [0] * (self.max_len - len(input_masks_list))

    assert len(input_ids_list) == self.max_len
    assert len(segment_ids_list) == self.max_len
    assert len(input_masks_list) == self.max_len

    return input_ids_list, segment_ids_list, input_masks_list

I'm not truncating anything after I create my full sequence. That suggestion was great for using text_pair, I dont know why I didnt think of that. Thank you
PS- Is this way of creating input ids correct, and in the older version, when this warning wasn't being logged to the terminal, was it using only_first even then?

@LysandreJik
Copy link
Member

In that snippet, are you trying to concatenate a lot of sequences together? If you have 10 sequences in your text, you want to have a giant input_ids_list containing all the 10 sequences separated by a separator token?

@llStringll
Copy link
Author

Yes, exactly, thats what I am doing, and I am then strippin off earlier later part, coz I am flippin the list too. Basically the list is a conversation where I am making a new token list out of recent N words of conversation

@amaiya
Copy link

amaiya commented Jul 2, 2020

For anyone stumbling across this issue and having problems with sentence pair classification in v3.0.0:

In v3.0.0, the default truncation strategy was changed, which causes code that used to work in v2.11.0 to break in some cases.
v2.11.0: default truncation strategy is longest_first
v.3.0.0: truncation strategy appears to default to only_first

For sentence pair classification in v3.0.0, this can result in a failure to truncate sentence pair to the supplied max_length parameter, which can break downstream model or other code:

W0702 12:56:50.435204 140139424331584 tokenization_utils_base.py:1447] Truncation was not explicitely activated but 
`max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length.
 Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may 
want to check this is the right behavior.
E0702 12:56:50.437675 140139424331584 tokenization_utils.py:784] We need to remove 25 to truncate the input but the first
 sequence has a length 17. Please select another truncation strategy than TruncationStrategy.ONLY_FIRST, for instance 
'longest_first' or 'only_second'.

For example, the following code prints 32 in v2.11.0, but 57 in v3.0.0:

text_a = '''Debunk this: Six Corporations Control $NUMBER$% Of The Media In America'''
text_b = '''
I can't believe people are missing the two obvious flaws in this analysis. 
This infographic doesn't show that $NUMBER$ companies control $NUMBER$% of the media. '''
from transformers import *
t = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
output = t.encode_plus(text_a, text_b, max_length=32)
print(len(output['input_ids']))

The solution is to explicitly provide truncate='longest_first:, as indicated in the warning:

output = t.encode_plus(text_a, text_b, max_length=32, truncation='longest_first')

@thomwolf
Copy link
Member

thomwolf commented Jul 2, 2020

Good point. We will release a patch to fix this breaking change (move back to having longest_first as default) plus the one mentioned in #5447 probably tomorrow or early next week.

@amaiya
Copy link

amaiya commented Jul 2, 2020

@thomwolf: Thanks - changing the default back to longest_first may also address #5460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants