Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for rnnt_loss.py #1177

Merged
merged 6 commits into from
Apr 26, 2023
Merged

Fix for rnnt_loss.py #1177

merged 6 commits into from
Apr 26, 2023

Conversation

yfyeung
Copy link
Contributor

@yfyeung yfyeung commented Apr 25, 2023

No description provided.

@pkufool pkufool added the ready Ready for review and trigger GitHub actions to run label Apr 25, 2023
@pkufool pkufool added ready Ready for review and trigger GitHub actions to run and removed ready Ready for review and trigger GitHub actions to run labels Apr 25, 2023
@pkufool pkufool merged commit a23383c into k2-fsa:master Apr 26, 2023
@yfyeung yfyeung deleted the yfyeung-patch-1 branch April 26, 2023 07:33
@danpovey
Copy link
Collaborator

This PR is to fix an issue where a lot of memory is used in the backward pass of the simple RNNT loss.
It can cause random-seeming failures where, after some time, training ends with a message like this:

    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 5.37 GiB (GPU 0; 31.75 GiB total capacity; 17.47 GiB already allocated; 12.64 GiB free; 17.83 GiB reserved in total by PyTorch)

(note, there remains a mystery why, often, it seems to be asking for much less memory than the device has free (that 12.64 GiB number comes from the device_free of cudaMemGetInfo(&device_free, &device_total))... possibly this has to do with other things using the machine; but regardless, the fact is that way more memory is being used in the backward pass than really needs to be used.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready Ready for review and trigger GitHub actions to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants