-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script run_mlm_no_trainer.py error #15081
Comments
cc @sgugger |
Which command are you running exactly? The logs you produce use distributed training whereas the command you told us (which runs successfully on my side) launches the script with python. |
I just rerun it on another machine but got the same issue. The exact command is: $ python run_mlm_no_trainer.py --model_name_or_path=./roberta-base --dataset_name=wikitext --dataset_config_name=wikitext-2-raw-v1 --output_dir=./test_mlm_out where
The output was: 01/11/2022 11:59:36 - INFO - __main__ - ***** Running training *****
01/11/2022 11:59:36 - INFO - __main__ - Num examples = 2390
01/11/2022 11:59:36 - INFO - __main__ - Num Epochs = 3
01/11/2022 11:59:36 - INFO - __main__ - Instantaneous batch size per device = 8
01/11/2022 11:59:36 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
01/11/2022 11:59:36 - INFO - __main__ - Gradient Accumulation steps = 1
01/11/2022 11:59:36 - INFO - __main__ - Total optimization steps = 897
0%| | 0/897 [00:00<?, ?it/s]Traceback (most recent call last):
File "run_mlm_no_trainer.py", line 566, in <module>
main()
File "run_mlm_no_trainer.py", line 513, in main
outputs = model(**batch)
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 1106, in forward
return_dict=return_dict,
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 817, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1. Target sizes: [8, 1024]. Tensor sizes: [1, 514]
0%| | 0/897 [00:00<?, ?it/s] Possible Solution |
I have no idea what the content of your roberta-base folder is, but your addition is probably correct. It works with the official checkpoint, where the model specifies a max length the script then uses, maybe it's the part missing in your local checkpoint. |
Yeah you are correct. The checkpoint that the official script downloaded works. There might be something mismatched in my cached roberta-base folder (just manually downloaded from AWS, probability not newest ones). Thank you for pointing out this. |
Environment info
transformers
version: 4.14.0.dev0Who can help
@patrickvonplaten @LysandreJik
Information
Model I am using: roberta-base
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Following the official instruction at python run_mlm_no_trainer.py
Expected behavior
The text was updated successfully, but these errors were encountered: