-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small fix for rnn_loss.py #1183
Conversation
This is another fix for a memory issue in rnnt_loss.py, similar to #1177- the problem is that a too-large zero tensor is used in the backward pass, which can lead to OOM errors in training for large batch sizes. I discovered the problem using a version of PyTorch that was built with debug info. I ran a training using large batch size (1450) and 1 job, using gdb --args.
When it stopped at the point where it was about to issue this error:
... I went to stack frame 65 which had size info:
and printed out the size info:
(I figured out these expression from looking at the types printed out for these variables.) |
No description provided.