large memory usage when resuming training from a checkpoint #11317

dorost1234 · 2021-04-19T12:53:16Z

Environment info

transformers version: 4.4.5
Platform: linux
Python version: 3.7
PyTorch version (GPU?): 1.8
Tensorflow version (GPU?): -
Using GPU in script?: -
Using distributed or parallel set-up in script?: -

Who can help

@sgugger @patrickvonplaten, @patil-suraj

Information

Hi
I am training t5-base model on mnli dataset, with batch size = 128, the training works fine, but the moment, I want to resume from a checkpoint, then I will get a memory issue, so I observe large memory usage when it is resuming the training.

Expected behavior

resuming from a checkpoint and training, should take equal amount of memory

Error Stack

Traceback (most recent call last):
  File "run_seq2seq.py", line 671, in <module>
    main()
  File "run_seq2seq.py", line 629, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/users/dara/dev/codes/seq2seq/third_party/trainers/trainer.py", line 329, in train
    tr_loss += self.training_step(model, inputs)
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/transformers/trainer.py", line 1486, in training_step
    loss = self.compute_loss(model, inputs)
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/transformers/trainer.py", line 1518, in compute_loss
    outputs = model(**inputs)
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/users/dara/dev/codes/seq2seq/third_party/models/t5/modeling_t5.py", line 1762, in forward
    lang=lang
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/users/dara/dev/codes/seq2seq/third_party/models/t5/modeling_t5.py", line 1115, in forward
    task=task
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/users/dara/dev/codes/seq2seq/third_party/models/t5/modeling_t5.py", line 752, in forward
    output_attentions=output_attentions,
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/users/dara/dev/codes/seq2seq/third_party/models/t5/modeling_t5.py", line 653, in forward
    output_attentions=output_attentions,
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/users/dara/dev/codes/seq2seq/third_party/models/t5/modeling_t5.py", line 518, in forward
    hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
  File "/users/dara/dev/codes/seq2seq/third_party/models/t5/modeling_t5.py", line 501, in project
    hidden_states = shape(proj_layer(key_value_states))
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/users/dara/libs/anaconda3/envs/test1/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 23.70 GiB total capacity; 21.38 GiB already allocated; 41.69 MiB free; 22.18 GiB reserved in total by PyTorch)
  0%|

Thanks for your help and suggestions.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-19T13:50:05Z

Similar issue to #11294

github-actions · 2021-05-19T15:07:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sgugger mentioned this issue Apr 19, 2021

Load checkpoint without re-creating the model #11318

Merged

github-actions bot closed this as completed May 27, 2021

thies1006 mentioned this issue Jul 13, 2021

Running out of memory when resume training. #12680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large memory usage when resuming training from a checkpoint #11317

large memory usage when resuming training from a checkpoint #11317

dorost1234 commented Apr 19, 2021 •

edited

Loading

LysandreJik commented Apr 19, 2021

github-actions bot commented May 19, 2021

large memory usage when resuming training from a checkpoint #11317

large memory usage when resuming training from a checkpoint #11317

Comments

dorost1234 commented Apr 19, 2021 • edited Loading

Environment info

Who can help

Information

Expected behavior

Error Stack

LysandreJik commented Apr 19, 2021

github-actions bot commented May 19, 2021

dorost1234 commented Apr 19, 2021 •

edited

Loading