Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with BertTokenizer using additional_special_tokens #4229

Closed
2 tasks
pvcastro opened this issue May 8, 2020 · 2 comments
Closed
2 tasks

Problem with BertTokenizer using additional_special_tokens #4229

pvcastro opened this issue May 8, 2020 · 2 comments
Labels

Comments

@pvcastro
Copy link

pvcastro commented May 8, 2020

🐛 Bug

Information

Model I am using: Bert (bert-base-uncased)

Language I am using the model on: English

The problem arises when using:

  • the official example scripts: (give details below)

The tasks I am working on is:

  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. import transformers
  2. from transformers import BertTokenizer
  3. specify additional_special_tokens as ["[E11]", "[E12]", "[E21]", "[E22]"]
  4. instantiate tokenizer as tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, additional_special_tokens=additional_special_tokens)
  5. tokenize test string with the special tokens '[E11] Tom Thabane [E12] resigned in October last year to form the [E21] All Basotho Convention [E22] -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .' using tokenizer.tokenize(test_string)

Expected behavior

I would expect that the tokenization would give me regular tokens from wordpiece and the special tokens would be kept intact as [E11], [E12], etc, but instead I get:

['[', 'e', '##11', ']', 'tom', 'tha', '##bane', '[', 'e', '##12', ']', 'resigned', 'in', 'october', 'last', 'year', 'to', 'form', 'the', '[', 'e', '##21', ']', 'all', 'bas', '##otho', 'convention', '[', 'e', '##22', ']', '-', 'l', '##rb', '-', 'abc', '-', 'rr', '##b', '-', ',', 'crossing', 'the', 'floor', 'with', '17', 'members', 'of', 'parliament', ',', 'causing', 'constitutional', 'monarch', 'king', 'lets', '##ie', 'iii', 'to', 'dissolve', 'parliament', 'and', 'call', 'the', 'snap', 'election', '.']

I'm trying to run a training from https://github.com/mickeystroller/R-BERT, and reported this to the author, but he seems to get the proper results, even though we're both using transformers 2.8.0:

His results:

image

My results:

image

Environment info

  • transformers version: 2.8.0
  • Platform: Linux-4.15.0-99-generic-x86_64-with-debian-stretch-sid
  • Python version: 3.7.5
  • PyTorch version (GPU?): 1.3.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes, GTX 1070 with CUDA 10.1.243
  • Using distributed or parallel set-up in script?: No

Here's the output from R-BERT's author as well:

  • transformers version: 2.8.0
  • Platform: Linux-4.15.0-72-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.4
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

The author from R-BERT seems to be using tokenizers version 0.5.2, and mine is 0.7.0. I tried downgrading mine to 0.5.2 to see if I would get the same results he did, but this doesn't seem to work because it's not compatible with transformers 2.8.0, as can be seen below. I have no Idea how he was able to use both together:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-514ca6d60059> in <module>
----> 1 import transformers
      2 from transformers import BertTokenizer

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/__init__.py in <module>
     53 from .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
     54 from .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
---> 55 from .data import (
     56     DataProcessor,
     57     InputExample,

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/__init__.py in <module>
      4 
      5 from .metrics import is_sklearn_available
----> 6 from .processors import (
      7     DataProcessor,
      8     InputExample,

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/processors/__init__.py in <module>
      3 # module, but to preserve other warnings. So, don't check this module at all.
      4 
----> 5 from .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels
      6 from .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features
      7 from .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/processors/glue.py in <module>
     21 
     22 from ...file_utils import is_tf_available
---> 23 from ...tokenization_utils import PreTrainedTokenizer
     24 from .utils import DataProcessor, InputExample, InputFeatures
     25 

/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/tokenization_utils.py in <module>
     27 from typing import List, Optional, Sequence, Tuple, Union
     28 
---> 29 from tokenizers import AddedToken, Encoding
     30 from tokenizers.decoders import Decoder
     31 from tokenizers.implementations import BaseTokenizer

ImportError: cannot import name 'AddedToken' from 'tokenizers' (/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/tokenizers/__init__.py)
@monologg
Copy link
Contributor

I've also confronted this issue.
In my case, with transformers v2.8.0 it correctly splits special token.
But from v2.9.0, it splits the special tokens like '[', 'e', '##11', ']'. It's quite weird...

@stale
Copy link

stale bot commented Jul 17, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 17, 2020
@stale stale bot closed this as completed Jul 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants