You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
I have included in the "Description" section below a traceback from any exceptions related to this bug.
I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
I have included in the "Environment" section below the output of pip freeze.
I have included in the "Steps to reproduce" section below a minimally reproducible example.
Description
Additional GPU memory is used even after training stops. This issue does not happen when the last epoch is also the best model so far.
Python traceback:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/allennlp/commands/train.py", line 504, in _train_worker
metrics = train_loop.run()
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/allennlp/commands/train.py", line 577, in run
return self.trainer.train()
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/allennlp/training/gradient_descent_trainer.py", line 769, in train
metrics, epoch = self._try_train()
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/allennlp/training/gradient_descent_trainer.py", line 956, in _try_train
self._load_model_state(self._best_model_filename)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/allennlp/training/gradient_descent_trainer.py", line 968, in _load_model_state
self._ddp_wrapped_model.load_state_dict(torch.load(path))
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/serialization.py", line 857, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/serialization.py", line 846, in load_tensor
loaded_storages[key] = restore_location(storage, location)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/serialization.py", line 157, in _cuda_deserialize
return obj.cuda(device)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/_utils.py", line 79, in _cuda
return new_type(self.size()).copy_(self, non_blocking)
File "/home/void/miniconda3/envs/lexsiamese/lib/python3.9/site-packages/torch/cuda/__init__.py", line 606, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
This is hard to reproduce in a minimal example because it depends on GPU memory. But I can reproduce it by training my model with batch size large enough to take up all the GPU space.
In first run tweak number of epochs so that the last epoch also ends up being the best epoch. In this scenario the error does not happen.
In 2nd run increase the number of epochs or change metric such that the last epoch is no longer the best epoch, can reliably do this by making the model run out of patience. This will produce an out of memory error after the training has finished.
From looking I think this is the potential problem:
Hey @vikigenius, I think this is a quick fix. The issue is that when self._load_model_state(self._best_model_filename) is called, we keep the model on GPU, and so have to load the state dictionary onto the same GPU as well. But there's really no reason why we need to keep the model on GPU at that point. So what we should do is move the model to CPU and then load the state dict to CPU.
If that makes sense, would you like to make a PR to fix? I think all of that could happen in the _load_model_state() method.
Checklist
main
branch of AllenNLP.pip freeze
.Description
Additional GPU memory is used even after training stops. This issue does not happen when the last epoch is also the best model so far.
Python traceback:
Related issues or possible duplicates
Environment
OS:
Python version: 3.9.7
Output of
pip freeze
:Steps to reproduce
This is hard to reproduce in a minimal example because it depends on GPU memory. But I can reproduce it by training my model with batch size large enough to take up all the GPU space.
In first run tweak number of epochs so that the last epoch also ends up being the best epoch. In this scenario the error does not happen.
In 2nd run increase the number of epochs or change metric such that the last epoch is no longer the best epoch, can reliably do this by making the model run out of patience. This will produce an out of memory error after the training has finished.
From looking I think this is the potential problem:
allennlp/allennlp/training/gradient_descent_trainer.py
Lines 953 to 957 in 38436d8
The else condition where the model state dict is loaded again is causing the issue maybe?
Edit
Here is the worker 3 logs that confirm it has finished it's entire training and validation and has ran out of patience.
out_worker3.log
The text was updated successfully, but these errors were encountered: