Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
non_blocking=True
from all.to("cpu")
calls..synchronize()
calls (saves ~10 sec/50 iter when offloading)Some environments, notably WSL, don't fully support memory pinning / concurrent CPU-GPU access. 1 Removing non_blocking to .to(cpu) calls resolves unexpected cuda OOM errors.
From my (limited) understanding of how
non_blocking
operates under the hood, this shouldn't negatively impact performance. 2In testing, I found the bf16 timings were actually 10-30s lower than those reported in the Different inference settings table, but other code changes I made beforehand may have influenced that as well.
Example of Error
This is the second of 3 PRs I'm issuing to improve performance/fix errors. I've tried to keep each incremental change as small in scope as possible. PRs: 1. #149, 2. This, 3. #151
Update (2024-12-02):
Footnotes
https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-applications ↩
https://pytorch.org/tutorials/intermediate/pinmem_nonblock.html ↩