Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150

Rypo · 2024-11-28T02:19:01Z

Changes

Removed non_blocking=True from all .to("cpu") calls.
Slightly tweaked .synchronize() calls (saves ~10 sec/50 iter when offloading)

Some environments, notably WSL, don't fully support memory pinning / concurrent CPU-GPU access. ¹ Removing non_blocking to .to(cpu) calls resolves unexpected cuda OOM errors.

From my (limited) understanding of how non_blocking operates under the hood, this shouldn't negatively impact performance. ²

In testing, I found the bf16 timings were actually 10-30s lower than those reported in the Different inference settings table, but other code changes I made beforehand may have influenced that as well.

Example of Error

import torch
device = torch.device('cuda:0')

def print_mem_free(device=None):
    mem_free, mem_total = torch.cuda.mem_get_info(device)
    print(f'Mem Free: {mem_free/(1024**3):0.2f} GB')

print_mem_free(device)
>>> Mem Free: 22.76 GB

r = torch.rand(1_000_000_000, dtype=torch.float32, device=device) # 4GB
print_mem_free(device)
>>> Mem Free: 19.03 GB

r = r.to("cpu")
torch.cuda.empty_cache()
print_mem_free(device)
>>> Mem Free: 22.76 GB

r = r.to(device)
print_mem_free(device)
>>> Mem Free: 19.03 GB

r = r.to("cpu", non_blocking=True)
>>>
    RuntimeError: CUDA error: out of memory
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is the second of 3 PRs I'm issuing to improve performance/fix errors. I've tried to keep each incremental change as small in scope as possible. PRs: 1. #149, 2. This, 3. #151

Update (2024-12-02):

This PR now branches off main. It is no longer a dependant of Reduce initial pipeline load time by 4-5x (1/3) #149. See Adds support for 4bit (nf4) and 8bit bitsandbytes quantization (3/3) #151 for discussion.

Removes non_blocking argument from all device to cpu transfers. In certain environments (e.g. WSL) large transfers will throw a CUDA memory error regardless of VRAM available. Adjusts stream synchronize for modest performance gains with cpu_offload. fixes VectorSpaceLab#90, fixes VectorSpaceLab#117

This was referenced Nov 28, 2024

Reduce initial pipeline load time by 4-5x (1/3) #149

Open

Adds support for 4bit (nf4) and 8bit bitsandbytes quantization (3/3) #151

Open

Rypo force-pushed the fix_non_blocking branch from 2fd6a5d to 7383566 Compare December 2, 2024 22:46

Rypo added 3 commits December 3, 2024 08:21

fix: revert layer offload iteration

0fa5f5d

Merge branch 'main' into fix_non_blocking

98679fa

Merge branch 'main' into fix_non_blocking

329b876

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150

Rypo commented Nov 28, 2024 •

edited

Loading

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150

Are you sure you want to change the base?

Fix RuntimeError: CUDA error: out of memory on CPU transfer (2/3) #150

Conversation

Rypo commented Nov 28, 2024 • edited Loading

Changes

Example of Error

Footnotes

Rypo commented Nov 28, 2024 •

edited

Loading