Problem with BertTokenizer using additional_special_tokens #4229

pvcastro · 2020-05-08T10:38:28Z

🐛 Bug

Information

Model I am using: Bert (bert-base-uncased)

Language I am using the model on: English

The problem arises when using:

the official example scripts: (give details below)

The tasks I am working on is:

my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

import transformers
from transformers import BertTokenizer
specify additional_special_tokens as ["[E11]", "[E12]", "[E21]", "[E22]"]
instantiate tokenizer as tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, additional_special_tokens=additional_special_tokens)
tokenize test string with the special tokens '[E11] Tom Thabane [E12] resigned in October last year to form the [E21] All Basotho Convention [E22] -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .' using tokenizer.tokenize(test_string)

Expected behavior

I would expect that the tokenization would give me regular tokens from wordpiece and the special tokens would be kept intact as [E11], [E12], etc, but instead I get:

['[', 'e', '##11', ']', 'tom', 'tha', '##bane', '[', 'e', '##12', ']', 'resigned', 'in', 'october', 'last', 'year', 'to', 'form', 'the', '[', 'e', '##21', ']', 'all', 'bas', '##otho', 'convention', '[', 'e', '##22', ']', '-', 'l', '##rb', '-', 'abc', '-', 'rr', '##b', '-', ',', 'crossing', 'the', 'floor', 'with', '17', 'members', 'of', 'parliament', ',', 'causing', 'constitutional', 'monarch', 'king', 'lets', '##ie', 'iii', 'to', 'dissolve', 'parliament', 'and', 'call', 'the', 'snap', 'election', '.']

I'm trying to run a training from https://github.com/mickeystroller/R-BERT, and reported this to the author, but he seems to get the proper results, even though we're both using transformers 2.8.0:

His results:

My results:

Environment info

transformers version: 2.8.0
Platform: Linux-4.15.0-99-generic-x86_64-with-debian-stretch-sid
Python version: 3.7.5
PyTorch version (GPU?): 1.3.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes, GTX 1070 with CUDA 10.1.243
Using distributed or parallel set-up in script?: No

Here's the output from R-BERT's author as well:

transformers version: 2.8.0
Platform: Linux-4.15.0-72-generic-x86_64-with-debian-buster-sid
Python version: 3.7.4
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

The author from R-BERT seems to be using tokenizers version 0.5.2, and mine is 0.7.0. I tried downgrading mine to 0.5.2 to see if I would get the same results he did, but this doesn't seem to work because it's not compatible with transformers 2.8.0, as can be seen below. I have no Idea how he was able to use both together:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-514ca6d60059> in <module>
----> 1 import transformers
      2 from transformers import BertTokenizer

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/__init__.py in <module>
     53 from .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
     54 from .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
---> 55 from .data import (
     56     DataProcessor,
     57     InputExample,

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/__init__.py in <module>
      4 
      5 from .metrics import is_sklearn_available
----> 6 from .processors import (
      7     DataProcessor,
      8     InputExample,

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/processors/__init__.py in <module>
      3 # module, but to preserve other warnings. So, don't check this module at all.
      4 
----> 5 from .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels
      6 from .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features
      7 from .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/processors/glue.py in <module>
     21 
     22 from ...file_utils import is_tf_available
---> 23 from ...tokenization_utils import PreTrainedTokenizer
     24 from .utils import DataProcessor, InputExample, InputFeatures
     25 

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/tokenization_utils.py in <module>
     27 from typing import List, Optional, Sequence, Tuple, Union
     28 
---> 29 from tokenizers import AddedToken, Encoding
     30 from tokenizers.decoders import Decoder
     31 from tokenizers.implementations import BaseTokenizer

ImportError: cannot import name 'AddedToken' from 'tokenizers' (/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/tokenizers/__init__.py)

The text was updated successfully, but these errors were encountered:

monologg · 2020-05-18T18:25:12Z

I've also confronted this issue.
In my case, with transformers v2.8.0 it correctly splits special token.
But from v2.9.0, it splits the special tokens like '[', 'e', '##11', ']'. It's quite weird...

stale · 2020-07-17T23:42:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pvcastro mentioned this issue May 8, 2020

Problem with special tokens mickeysjm/R-BERT#1

Open

stale bot added the wontfix label Jul 17, 2020

stale bot closed this as completed Jul 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with BertTokenizer using additional_special_tokens #4229

Problem with BertTokenizer using additional_special_tokens #4229

pvcastro commented May 8, 2020 •

edited

Loading

monologg commented May 18, 2020

stale bot commented Jul 17, 2020

Problem with BertTokenizer using additional_special_tokens #4229

Problem with BertTokenizer using additional_special_tokens #4229

Comments

pvcastro commented May 8, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

monologg commented May 18, 2020

stale bot commented Jul 17, 2020

pvcastro commented May 8, 2020 •

edited

Loading