Script run_mlm_no_trainer.py error #15081

cyk1337 · 2022-01-09T13:16:04Z

Environment info

transformers version: 4.14.0.dev0
Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
Python version: 3.7.11
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): 2.7.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.6 (cpu)
Jax version: 0.2.26
JaxLib version: 0.1.75
Using GPU in script?: Y
Using distributed or parallel set-up in script?: Y

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using: roberta-base

The problem arises when using:

the official example scripts: examples/pytorch/language-modeling/run_mlm_no_trainer.py

The tasks I am working on is:

an official pre-training task: run the mlm pre-training script.

To reproduce

Steps to reproduce the behavior:

Following the official instruction at python run_mlm_no_trainer.py

python run_mlm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path roberta-base \
    --output_dir /tmp/test-mlm

Expected behavior

[INFO|trainer.py:1204] 2022-01-09 20:51:14,185 >> ***** Running training *****
[INFO|trainer.py:1205] 2022-01-09 20:51:14,185 >>   Num examples = 4650
[INFO|trainer.py:1206] 2022-01-09 20:51:14,185 >>   Num Epochs = 3
[INFO|trainer.py:1207] 2022-01-09 20:51:14,185 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1208] 2022-01-09 20:51:14,186 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1209] 2022-01-09 20:51:14,186 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1210] 2022-01-09 20:51:14,186 >>   Total optimization steps = 219
  0%|                                                                                                   | 0/219 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/transformers/examples/pytorch/demo/run_mlm.py", line 556, in <module>
    main()
  File "/home/xxx/transformers/examples/pytorch/demo/run_mlm.py", line 505, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1325, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1884, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1916, in compute_loss
    outputs = model(**inputs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/transformers/src/transformers/models/roberta/modeling_roberta.py", line 1108, in forward
    return_dict=return_dict,
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/transformers/src/transformers/models/roberta/modeling_roberta.py", line 819, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [8, 1024].  Tensor sizes: [1, 514]

The text was updated successfully, but these errors were encountered:

LysandreJik · 2022-01-10T14:54:00Z

cc @sgugger

sgugger · 2022-01-10T15:36:37Z

Which command are you running exactly? The logs you produce use distributed training whereas the command you told us (which runs successfully on my side) launches the script with python.

cyk1337 · 2022-01-11T04:13:30Z

I just rerun it on another machine but got the same issue.

The exact command is:

$ python run_mlm_no_trainer.py --model_name_or_path=./roberta-base --dataset_name=wikitext --dataset_config_name=wikitext-2-raw-v1 --output_dir=./test_mlm_out

where ./roberta-base directory contains:

 $ ls roberta-base/
config.json  merges.txt  pytorch_model.bin  vocab.json

The output was:

01/11/2022 11:59:36 - INFO - __main__ - ***** Running training *****
01/11/2022 11:59:36 - INFO - __main__ -   Num examples = 2390
01/11/2022 11:59:36 - INFO - __main__ -   Num Epochs = 3
01/11/2022 11:59:36 - INFO - __main__ -   Instantaneous batch size per device = 8
01/11/2022 11:59:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
01/11/2022 11:59:36 - INFO - __main__ -   Gradient Accumulation steps = 1
01/11/2022 11:59:36 - INFO - __main__ -   Total optimization steps = 897
  0%|                                                                                                                                                                                         | 0/897 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_mlm_no_trainer.py", line 566, in <module>
    main()
  File "run_mlm_no_trainer.py", line 513, in main
    outputs = model(**batch)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 1106, in forward
    return_dict=return_dict,
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 817, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [8, 1024].  Tensor sizes: [1, 514]
  0%|                                                                                                                                                                                         | 0/897 [00:00<?, ?it/s]

Possible Solution
The issue reported was due to the last dim mismatch between the target size (1024) and tensor size (514) oftoken_type_ids. I suspect this is caused by unspecified --max_seq_length=512. With additional argument --max_seq_length=512, it works. Is it correct?

sgugger · 2022-01-11T12:31:47Z

I have no idea what the content of your roberta-base folder is, but your addition is probably correct. It works with the official checkpoint, where the model specifies a max length the script then uses, maybe it's the part missing in your local checkpoint.

cyk1337 · 2022-01-11T13:06:34Z

Yeah you are correct. The checkpoint that the official script downloaded works. There might be something mismatched in my cached roberta-base folder (just manually downloaded from AWS, probability not newest ones). Thank you for pointing out this.

cyk1337 closed this as completed Jan 11, 2022

Eurus-W mentioned this issue Mar 18, 2022

Trying to add support for GPT2 as decoder in EncoderDecoder model #4483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script run_mlm_no_trainer.py error #15081

Script run_mlm_no_trainer.py error #15081

cyk1337 commented Jan 9, 2022

LysandreJik commented Jan 10, 2022

sgugger commented Jan 10, 2022

cyk1337 commented Jan 11, 2022 •

edited

Loading

sgugger commented Jan 11, 2022

cyk1337 commented Jan 11, 2022

Script run_mlm_no_trainer.py error #15081

Script run_mlm_no_trainer.py error #15081

Comments

cyk1337 commented Jan 9, 2022

Environment info

Who can help

Information

To reproduce

Expected behavior

LysandreJik commented Jan 10, 2022

sgugger commented Jan 10, 2022

cyk1337 commented Jan 11, 2022 • edited Loading

sgugger commented Jan 11, 2022

cyk1337 commented Jan 11, 2022

cyk1337 commented Jan 11, 2022 •

edited

Loading