-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with special tokens #1
Comments
Hi @pvcastro, Which version of the transformer library are you using? In my local environment with Transformer version 2.8.0, the tokenizer works fine. I put a screen below for your reference. |
Indeed it looks strange. I don't know what happens here. Maybe you can restart the IPython kernel and run this pipeline from scratch again? |
Same problem. Can you tell me the version of your tokenizers package? |
My tokenizer library version is 0.5.2 |
Maybe you have a cached tokenizer with the additional special tokens saved to it? 🤔 It only performs an assert, it doesn't save them anywhere. |
I don't think I cache the tokenizer intentionally. Maybe the transformer library automatically do that? I think you can ask this question in the transformer library and get some supports from the developer of the transformer library. If you figure it out later, please kindly let me know. Thanks. |
I'll do that @mickeystroller . Would you mind running a |
Below is the transformers-cli env output:
|
Thanks! |
Hi @mickeystroller , how are you? One week since I opened the issue, and no replies from the transformers team yet.
|
Hi there!
How did you handle those special tokens?
additional_special_tokens = ["[E11]", "[E12]", "[E21]", "[E22]"]
Just passing them as an 'additional_special_tokens' parameter to the BertTokenizer.from_pretrained doesn't seem to have any effect. When we actually tokenize the texts, these special tokens get tokenized too:
['[', 'e', '##11', ']', 'tom', 'tha', '##bane', '[', 'e', '##12', ']', 'resigned', 'in', 'october', 'last', 'year', 'to', 'form', 'the', '[', 'e', '##21', ']', 'all', 'bas', '##otho', 'convention', '[', 'e', '##22', ']', '(', 'abc', ')', ',', 'crossing', 'the', 'floor', 'with', '17', 'members', 'of', 'parliament', ',', 'causing', 'constitutional', 'monarch', 'king', 'lets', '##ie', 'iii', 'to', 'dissolve', 'parliament', 'and', 'call', 'the', 'snap', 'election', '.']
The other repository you used as a reference seemed to have an issue with this too:
hint-lab/bert-relation-classification#4
I'm trying to manually call
tokenizer.add_special_tokens({'additional_special_tokens': additional_special_tokens})
But when do_lower_case is true, the tokens get lowercased as well, and they hit UNK when converting to ids.
Thanks!
The text was updated successfully, but these errors were encountered: