-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New tokenizer code in transformer 3.0.0 is creating error with old code #5377
Comments
Hi, this is not an error but a warning. If you want to disable warnings, you can use the following: import logging
logging.basicConfig(level=logging.ERROR) |
Oh, I'm sorry for writing "error" everywhere, but I want to know, is this default behaviour correct for BeRT, it says by default it'll use only_first |
You can read the documentation concerning that method here. Here's the part you're probably interested in:
Since you're only using a single sentence, it seems to be what you're looking for? |
I am also concatenating multiple "context" sequences using [SEP] token, I'm just feeding sequences into encode_plus and stripping off the [CLS] token at the beginning and concatenating the rest with the old one, making it |
Is there a reason why you're not using the If you pass a single sequence and want to build them with the special tokens yourself, you can use the Be careful as since you're building the sequences yourself, if you truncate these some of the special tokens might be cut off. |
This is my snippet, "texts" is a list of strings
I'm not truncating anything after I create my full sequence. That suggestion was great for using text_pair, I dont know why I didnt think of that. Thank you |
In that snippet, are you trying to concatenate a lot of sequences together? If you have 10 sequences in your text, you want to have a giant |
Yes, exactly, thats what I am doing, and I am then strippin off earlier later part, coz I am flippin the list too. Basically the list is a conversation where I am making a new token list out of recent N words of conversation |
For anyone stumbling across this issue and having problems with sentence pair classification in v3.0.0: In v3.0.0, the default truncation strategy was changed, which causes code that used to work in v2.11.0 to break in some cases. For sentence pair classification in v3.0.0, this can result in a failure to truncate sentence pair to the supplied
For example, the following code prints 32 in v2.11.0, but 57 in v3.0.0: text_a = '''Debunk this: Six Corporations Control $NUMBER$% Of The Media In America'''
text_b = '''
I can't believe people are missing the two obvious flaws in this analysis.
This infographic doesn't show that $NUMBER$ companies control $NUMBER$% of the media. '''
from transformers import *
t = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
output = t.encode_plus(text_a, text_b, max_length=32)
print(len(output['input_ids'])) The solution is to explicitly provide output = t.encode_plus(text_a, text_b, max_length=32, truncation='longest_first') |
Good point. We will release a patch to fix this breaking change (move back to having |
🐛 Bug
Information
Model I am using (Bert, XLNet ...): BeRT and GPT2 for Poly-encoder implementation
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
Ubuntu V2.0 corpus dataset, implementing Poly-encoder pipeline, everything was done, was re-training the model again to verify the results of the first training process.
To reproduce
Steps to reproduce the behavior:
Happens when using some_tokenizer_fromHF.encode_plus(), below is the eval script to test on custom input text, is exactly same as the one reading from dataset during training(simplified for eval)
The error is -
"Truncation was not explicitely activated but
max_length
is provided a specific value, please usetruncation=True
to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior."Repeated a total of 6 times, i.e., for every sequence passed into encode_plus
Expected behavior
Not to give this error, and return the input ids, segment ids, and input masks
The issue is completely identical to the closed issue - #5155
Environment info
transformers
version: 3.0.0The text was updated successfully, but these errors were encountered: