AttributeError for DataCollatorForLanguageModelling with tokenizers.Tokenizer #12583

lewisbails · 2021-07-08T12:45:37Z

Environment info

transformers version: 4.8.2
Platform: Linux-5.3.0-53-generic-x86_64-with-debian-buster-sid
Python version: 3.7.10
PyTorch version (GPU?): 1.9.0+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

tokenizers: @LysandreJik
trainer: @sgugger

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

tokenizer = Tokenizer.from_file("my-tokenizer.json")
config = AutoConfig.from_pretrained("bert-base-cased", vocab_size=tokenizer.get_vocab_size())
model = AutoModelForMaskedLM.from_config(config)

tokenizer.enable_truncation(max_length=model.config.max_position_embeddings)
dataset = LMDataset(tokenizer, files=['train_1.txt', 'train_2.txt'])
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, **cfg.data_collator_kwargs)

Traceback (most recent call last):
...
File "/home/leb/lang-models/scripts/train_lm.py", line 25, in train_lm
  data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, **cfg.data_collator_kwargs)
File "<string>", line 7, in __init__
File "/home/leb/anaconda3/envs/lang-models/lib/python3.7/site-packages/transformers/data/data_collator.py", line 333, in __post_init__
  if self.mlm and self.tokenizer.mask_token is None:
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'mask_token'

Expected behavior

Expected to be able to use tokenizers.Tokenizer in the tokenizer parameter to DataCollatorForLanguageModelling.

The text was updated successfully, but these errors were encountered:

sgugger · 2021-07-08T13:03:18Z

Hi there! I am unsure why you thought you could use a tokenizers.Tokenizer object here. The documentation clearly states it has to be a PreTrainedTokenizerBase, so either a PreTrainedTokenizer or a PreTrainedTokenizerFast. You can instantiate one with

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(tokenzier_file=path_to_json)

lewisbails · 2021-07-08T16:58:40Z

Sorry about that, I get confused what to use where between the two projects sometimes.
I've also found this to help me.
Although the documentation for PreTrainedTokenizerFast doesn't show tokenizer_file as a valid parameter to __init__

sgugger · 2021-07-08T17:25:00Z

Oh very true, it's definitely missing! Do you want to make a PR to fix it?

lewisbails closed this as completed Jul 8, 2021

lewisbails mentioned this issue Jul 10, 2021

Add tokenizer_file parameter to PreTrainedTokenizerFast docstring #12624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError for DataCollatorForLanguageModelling with tokenizers.Tokenizer #12583

AttributeError for DataCollatorForLanguageModelling with tokenizers.Tokenizer #12583

lewisbails commented Jul 8, 2021

sgugger commented Jul 8, 2021

lewisbails commented Jul 8, 2021

sgugger commented Jul 8, 2021

AttributeError for DataCollatorForLanguageModelling with tokenizers.Tokenizer #12583

AttributeError for DataCollatorForLanguageModelling with tokenizers.Tokenizer #12583

Comments

lewisbails commented Jul 8, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

sgugger commented Jul 8, 2021

lewisbails commented Jul 8, 2021

sgugger commented Jul 8, 2021