You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the official example scripts: (give details below)
The tasks I am working on is:
my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
import transformers
from transformers import BertTokenizer
specify additional_special_tokens as ["[E11]", "[E12]", "[E21]", "[E22]"]
instantiate tokenizer as tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, additional_special_tokens=additional_special_tokens)
tokenize test string with the special tokens '[E11] Tom Thabane [E12] resigned in October last year to form the [E21] All Basotho Convention [E22] -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .' using tokenizer.tokenize(test_string)
Expected behavior
I would expect that the tokenization would give me regular tokens from wordpiece and the special tokens would be kept intact as [E11], [E12], etc, but instead I get:
Using distributed or parallel set-up in script?: No
The author from R-BERT seems to be using tokenizers version 0.5.2, and mine is 0.7.0. I tried downgrading mine to 0.5.2 to see if I would get the same results he did, but this doesn't seem to work because it's not compatible with transformers 2.8.0, as can be seen below. I have no Idea how he was able to use both together:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-514ca6d60059> in <module>
----> 1 import transformers
2 from transformers import BertTokenizer
/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/__init__.py in <module>
53 from .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
54 from .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
---> 55 from .data import (
56 DataProcessor,
57 InputExample,
/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/__init__.py in <module>
4
5 from .metrics import is_sklearn_available
----> 6 from .processors import (
7 DataProcessor,
8 InputExample,
/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/processors/__init__.py in <module>
3 # module, but to preserve other warnings. So, don't check this module at all.
4
----> 5 from .glue import glue_convert_examples_to_features, glue_output_modes, glue_processors, glue_tasks_num_labels
6 from .squad import SquadExample, SquadFeatures, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features
7 from .utils import DataProcessor, InputExample, InputFeatures, SingleSentenceClassificationProcessor
/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/data/processors/glue.py in <module>
21
22 from ...file_utils import is_tf_available
---> 23 from ...tokenization_utils import PreTrainedTokenizer
24 from .utils import DataProcessor, InputExample, InputFeatures
25
/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/transformers/tokenization_utils.py in <module>
27 from typing import List, Optional, Sequence, Tuple, Union
28
---> 29 from tokenizers import AddedToken, Encoding
30 from tokenizers.decoders import Decoder
31 from tokenizers.implementations import BaseTokenizer
ImportError: cannot import name 'AddedToken' from 'tokenizers' (/media/discoD/anaconda3/envs/fast-bert/lib/python3.7/site-packages/tokenizers/__init__.py)
The text was updated successfully, but these errors were encountered:
I've also confronted this issue.
In my case, with transformers v2.8.0 it correctly splits special token.
But from v2.9.0, it splits the special tokens like '[', 'e', '##11', ']'. It's quite weird...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🐛 Bug
Information
Model I am using: Bert (bert-base-uncased)
Language I am using the model on: English
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, additional_special_tokens=additional_special_tokens)
'[E11] Tom Thabane [E12] resigned in October last year to form the [E21] All Basotho Convention [E22] -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .'
usingtokenizer.tokenize(test_string)
Expected behavior
I would expect that the tokenization would give me regular tokens from wordpiece and the special tokens would be kept intact as [E11], [E12], etc, but instead I get:
['[', 'e', '##11', ']', 'tom', 'tha', '##bane', '[', 'e', '##12', ']', 'resigned', 'in', 'october', 'last', 'year', 'to', 'form', 'the', '[', 'e', '##21', ']', 'all', 'bas', '##otho', 'convention', '[', 'e', '##22', ']', '-', 'l', '##rb', '-', 'abc', '-', 'rr', '##b', '-', ',', 'crossing', 'the', 'floor', 'with', '17', 'members', 'of', 'parliament', ',', 'causing', 'constitutional', 'monarch', 'king', 'lets', '##ie', 'iii', 'to', 'dissolve', 'parliament', 'and', 'call', 'the', 'snap', 'election', '.']
I'm trying to run a training from https://github.com/mickeystroller/R-BERT, and reported this to the author, but he seems to get the proper results, even though we're both using transformers 2.8.0:
His results:
My results:
Environment info
transformers
version: 2.8.0Here's the output from R-BERT's author as well:
transformers
version: 2.8.0The author from R-BERT seems to be using
tokenizers version 0.5.2
, and mine is 0.7.0. I tried downgrading mine to 0.5.2 to see if I would get the same results he did, but this doesn't seem to work because it's not compatible with transformers 2.8.0, as can be seen below. I have no Idea how he was able to use both together:The text was updated successfully, but these errors were encountered: