Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError for DataCollatorForLanguageModelling with tokenizers.Tokenizer #12583

Closed
2 of 4 tasks
lewisbails opened this issue Jul 8, 2021 · 3 comments
Closed
2 of 4 tasks

Comments

@lewisbails
Copy link
Contributor

Environment info

  • transformers version: 4.8.2
  • Platform: Linux-5.3.0-53-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.9.0+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

tokenizer = Tokenizer.from_file("my-tokenizer.json")
config = AutoConfig.from_pretrained("bert-base-cased", vocab_size=tokenizer.get_vocab_size())
model = AutoModelForMaskedLM.from_config(config)

tokenizer.enable_truncation(max_length=model.config.max_position_embeddings)
dataset = LMDataset(tokenizer, files=['train_1.txt', 'train_2.txt'])
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, **cfg.data_collator_kwargs)
Traceback (most recent call last):
...
File "/home/leb/lang-models/scripts/train_lm.py", line 25, in train_lm
  data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, **cfg.data_collator_kwargs)
File "<string>", line 7, in __init__
File "/home/leb/anaconda3/envs/lang-models/lib/python3.7/site-packages/transformers/data/data_collator.py", line 333, in __post_init__
  if self.mlm and self.tokenizer.mask_token is None:
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'mask_token'

Expected behavior

Expected to be able to use tokenizers.Tokenizer in the tokenizer parameter to DataCollatorForLanguageModelling.

@sgugger
Copy link
Collaborator

sgugger commented Jul 8, 2021

Hi there! I am unsure why you thought you could use a tokenizers.Tokenizer object here. The documentation clearly states it has to be a PreTrainedTokenizerBase, so either a PreTrainedTokenizer or a PreTrainedTokenizerFast. You can instantiate one with

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(tokenzier_file=path_to_json)

@lewisbails
Copy link
Contributor Author

Sorry about that, I get confused what to use where between the two projects sometimes.
I've also found this to help me.
Although the documentation for PreTrainedTokenizerFast doesn't show tokenizer_file as a valid parameter to __init__

@sgugger
Copy link
Collaborator

sgugger commented Jul 8, 2021

Oh very true, it's definitely missing! Do you want to make a PR to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants