Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Rypo
Copy link

@Rypo Rypo commented Nov 28, 2024

Changes

  • Removed non_blocking=True from all .to("cpu") calls.
  • Slightly tweaked .synchronize() calls (saves ~10 sec/50 iter when offloading)

Some environments, notably WSL, don't fully support memory pinning / concurrent CPU-GPU access. 1 Removing non_blocking to .to(cpu) calls resolves unexpected cuda OOM errors.

From my (limited) understanding of how non_blocking operates under the hood, this shouldn't negatively impact performance. 2

In testing, I found the bf16 timings were actually 10-30s lower than those reported in the Different inference settings table, but other code changes I made beforehand may have influenced that as well.

Example of Error

import torch
device = torch.device('cuda:0')

def print_mem_free(device=None):
    mem_free, mem_total = torch.cuda.mem_get_info(device)
    print(f'Mem Free: {mem_free/(1024**3):0.2f} GB')

print_mem_free(device)
>>> Mem Free: 22.76 GB

r = torch.rand(1_000_000_000, dtype=torch.float32, device=device) # 4GB
print_mem_free(device)
>>> Mem Free: 19.03 GB

r = r.to("cpu")
torch.cuda.empty_cache()
print_mem_free(device)
>>> Mem Free: 22.76 GB

r = r.to(device)
print_mem_free(device)
>>> Mem Free: 19.03 GB

r = r.to("cpu", non_blocking=True)
>>>
    RuntimeError: CUDA error: out of memory
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is the second of 3 PRs I'm issuing to improve performance/fix errors. I've tried to keep each incremental change as small in scope as possible. PRs: 1. #149, 2. This, 3. #151

Update (2024-12-02):

Footnotes

  1. https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-applications

  2. https://pytorch.org/tutorials/intermediate/pinmem_nonblock.html

Removes non_blocking argument from all device to cpu transfers. In certain environments (e.g. WSL) large transfers will throw a CUDA memory error regardless of VRAM available.

Adjusts stream synchronize for modest performance gains with cpu_offload.

fixes VectorSpaceLab#90, fixes VectorSpaceLab#117
@Rypo Rypo force-pushed the fix_non_blocking branch from 2fd6a5d to 7383566 Compare December 2, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant