Problem with special tokens #1

pvcastro · 2020-05-07T10:37:49Z

Hi there!

How did you handle those special tokens?
additional_special_tokens = ["[E11]", "[E12]", "[E21]", "[E22]"]

Just passing them as an 'additional_special_tokens' parameter to the BertTokenizer.from_pretrained doesn't seem to have any effect. When we actually tokenize the texts, these special tokens get tokenized too:

['[', 'e', '##11', ']', 'tom', 'tha', '##bane', '[', 'e', '##12', ']', 'resigned', 'in', 'october', 'last', 'year', 'to', 'form', 'the', '[', 'e', '##21', ']', 'all', 'bas', '##otho', 'convention', '[', 'e', '##22', ']', '(', 'abc', ')', ',', 'crossing', 'the', 'floor', 'with', '17', 'members', 'of', 'parliament', ',', 'causing', 'constitutional', 'monarch', 'king', 'lets', '##ie', 'iii', 'to', 'dissolve', 'parliament', 'and', 'call', 'the', 'snap', 'election', '.']

The other repository you used as a reference seemed to have an issue with this too:
hint-lab/bert-relation-classification#4

I'm trying to manually call
tokenizer.add_special_tokens({'additional_special_tokens': additional_special_tokens})
But when do_lower_case is true, the tokens get lowercased as well, and they hit UNK when converting to ids.

Thanks!

The text was updated successfully, but these errors were encountered:

mickeysjm · 2020-05-07T21:50:31Z

Hi @pvcastro,

Which version of the transformer library are you using? In my local environment with Transformer version 2.8.0, the tokenizer works fine. I put a screen below for your reference.

pvcastro · 2020-05-07T22:03:51Z

Strange @mickeystroller , I'm doing the exact same thing as you are, but take a look at my results

mickeysjm · 2020-05-07T22:12:11Z

Indeed it looks strange. I don't know what happens here. Maybe you can restart the IPython kernel and run this pipeline from scratch again?

pvcastro · 2020-05-07T22:14:20Z

Same problem. Can you tell me the version of your tokenizers package?
pip show tokenizers

mickeysjm · 2020-05-07T22:29:27Z

My tokenizer library version is 0.5.2

pvcastro · 2020-05-07T22:33:10Z

Strange, I tried downgrading to 0.5.2, and even though it installed correctly, importing transformer with it doesn't work:

pvcastro · 2020-05-07T23:02:23Z

Maybe you have a cached tokenizer with the additional special tokens saved to it? 🤔
There's nothing in BertTokenizer.from_pretrained that causes these tokens to be permanently attached to the tokenizer. Here's what the super class does with them:

It only performs an assert, it doesn't save them anywhere.

mickeysjm · 2020-05-07T23:20:41Z

I don't think I cache the tokenizer intentionally. Maybe the transformer library automatically do that? I think you can ask this question in the transformer library and get some supports from the developer of the transformer library. If you figure it out later, please kindly let me know.

Thanks.

pvcastro · 2020-05-08T10:24:33Z

I'll do that @mickeystroller . Would you mind running a transformers-cli env so I can add this information to the issue? They require this.

mickeysjm · 2020-05-08T10:53:03Z

Below is the transformers-cli env output:

transformers version: 2.8.0
Platform: Linux-4.15.0-72-generic-x86_64-with-debian-buster-sid
Python version: 3.7.4
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

pvcastro · 2020-05-08T11:20:09Z

Thanks!
Here's the opened issue:
huggingface/transformers#4229

pvcastro · 2020-05-15T14:16:16Z

Hi @mickeystroller , how are you? One week since I opened the issue, and no replies from the transformers team yet.
Do you mind creating a brand new conda environment, installing the latest transformers package and run this same simple test?

import transformers
from transformers import BertTokenizer
additional_special_tokens = ["[E11]", "[E12]", "[E21]", "[E22]"]
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, additional_special_tokens=additional_special_tokens)
test_string = '[E11] Tom Thabane [E12] resigned in October last year to form the [E21] All Basotho Convention [E22] -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .'
tokenizer.tokenize(test_string)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with special tokens #1

Problem with special tokens #1

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 7, 2020

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 8, 2020

mickeysjm commented May 8, 2020

pvcastro commented May 8, 2020

pvcastro commented May 15, 2020

Problem with special tokens #1

Problem with special tokens #1

Comments

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 7, 2020

pvcastro commented May 7, 2020

mickeysjm commented May 7, 2020

pvcastro commented May 8, 2020

mickeysjm commented May 8, 2020

pvcastro commented May 8, 2020

pvcastro commented May 15, 2020