You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
...
File "/home/leb/lang-models/scripts/train_lm.py", line 25, in train_lm
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, **cfg.data_collator_kwargs)
File "<string>", line 7, in __init__
File "/home/leb/anaconda3/envs/lang-models/lib/python3.7/site-packages/transformers/data/data_collator.py", line 333, in __post_init__
if self.mlm and self.tokenizer.mask_token is None:
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'mask_token'
Expected behavior
Expected to be able to use tokenizers.Tokenizer in the tokenizer parameter to DataCollatorForLanguageModelling.
The text was updated successfully, but these errors were encountered:
Hi there! I am unsure why you thought you could use a tokenizers.Tokenizer object here. The documentation clearly states it has to be a PreTrainedTokenizerBase, so either a PreTrainedTokenizer or a PreTrainedTokenizerFast. You can instantiate one with
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenzier_file=path_to_json)
Sorry about that, I get confused what to use where between the two projects sometimes.
I've also found this to help me.
Although the documentation for PreTrainedTokenizerFast doesn't show tokenizer_file as a valid parameter to __init__
Environment info
transformers
version: 4.8.2Who can help
Information
Model I am using (Bert, XLNet ...): Bert
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
Expected to be able to use tokenizers.Tokenizer in the tokenizer parameter to DataCollatorForLanguageModelling.
The text was updated successfully, but these errors were encountered: