Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Eliminate init_cuda_buffer and dummy_batch #3412

Closed
stephenroller opened this issue Jan 26, 2021 · 2 comments
Closed

Eliminate init_cuda_buffer and dummy_batch #3412

stephenroller opened this issue Jan 26, 2021 · 2 comments
Assignees

Comments

@stephenroller
Copy link
Contributor

stephenroller commented Jan 26, 2021

[Let's put put this off until we implement sharded_ddp and others, unsure how they're going to work together yet, #3415 ]

In early implementations of distributed training, it was necessary to do a "dummy pass" if a worker OOMed, in order to ensure that it correctly synced with others. You can see this code here:

oom_sync = False
except RuntimeError as e:
# catch out of memory exceptions during fwd/bck (skip batch)
if 'out of memory' in str(e):
oom_sync = True
logging.error(
'Ran out of memory, skipping batch. '
'if this happens frequently, decrease batchsize or '
'truncate the inputs to the model.'
)
self.global_metrics.add('skipped_batches', SumMetric(1))
else:
raise e
if oom_sync:
# moved outside of the try-except because the raised exception in scope
# actually prevents from the data being freed, which can sometimes cause
# us to OOM during our OOM handling.
# https://github.com/pytorch/pytorch/issues/18853#issuecomment-583779161
# gradients are synced on backward, now this model is going to be
# out of sync! catch up with the other workers
self._init_cuda_buffer(8, 8, True)

def _init_cuda_buffer(self, batchsize, maxlen, force=False):
"""
Pre-initialize CUDA buffer by doing fake forward pass.
This is also used in distributed mode to force a worker to sync with others.
"""
if self.use_cuda and (force or not hasattr(self, 'buffer_initialized')):
try:
self._control_local_metrics(disabled=True)
loss = 0 * self.compute_loss(self._dummy_batch(batchsize, maxlen))
self._control_local_metrics(enabled=True)
self._temporarily_disable_local_metrics = False
self.backward(loss)
self.buffer_initialized = True
except RuntimeError as e:
if 'out of memory' in str(e):
m = (
'CUDA OOM: Lower batch size (-bs) from {} or lower '
' max sequence length (-tr) from {}'
''.format(batchsize, maxlen)
)
raise RuntimeError(m)
else:
raise e

This is slightly annoying as a user, as any time you add a custom field, you will need to implement _dummy_batch yourself.

DDP allows for a manual sync (see DDP.join) which has the same effect as this method. We should switch to this, and then remove all instances of _init_cuda_buffer and _dummy_batch from all our codebases.

@github-actions
Copy link

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

@stephenroller
Copy link
Contributor Author

Implemented in #3732

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants