Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #6

Open
anhnb206110 opened this issue Aug 23, 2024 · 3 comments
Open

CUDA out of memory #6

anhnb206110 opened this issue Aug 23, 2024 · 3 comments

Comments

@anhnb206110
Copy link

Thank you for your work on this project!

I followed your instructions to train the model using the 'male-3-casual' dataset from the PeopleSnapshot dataset, without modifying any configurations in the config file. However, I encountered a CUDA out of memory error during training.

Here’s the error message I received:

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.28 GiB (GPU 0; 39.39 GiB total capacity; 28.72 GiB already allocated; 1.14 GiB free; 34.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It appears that the memory usage increases with each epoch until it runs out of memory. Could you please help me understand why this is happening and suggest any possible solutions to resolve the issue?

@trThanhnguyen
Copy link

Hi @anhnb206110 , the increase in memory consumption is due to the kicking in of physical properties. You can find its setting in the configs/config.yaml file. Personally, I'm handling it by reducing the no. samples per pixel. Note that if so you'll need to change some related codes in models/intrinsic_avatar.py line 1392-1407.
Also, I'm not sure if it were the right approach, so I hope you too can update your solution while we're waiting for the authors recommendation.

@taconite
Copy link
Owner

taconite commented Sep 3, 2024

Thank you for your work on this project!

I followed your instructions to train the model using the 'male-3-casual' dataset from the PeopleSnapshot dataset, without modifying any configurations in the config file. However, I encountered a CUDA out of memory error during training.

Here’s the error message I received:

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.28 GiB (GPU 0; 39.39 GiB total capacity; 28.72 GiB already allocated; 1.14 GiB free; 34.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It appears that the memory usage increases with each epoch until it runs out of memory. Could you please help me understand why this is happening and suggest any possible solutions to resolve the issue?

Hi,

I tried a clean installation and tested on male-3-casual subject. Unfortunately, I did not run into any OOM issue. My current GPU has only 24GB VRAM (TITAN RTX). The VRAM usage peaks at ~21GB.

Could you specify during which epochs you observe memory growth? Does the OOM happen during training or validation (validation happens every 2000 steps - it is interleaved with training)

On the other hand, my test environment is Ubuntu 20.04/CentOS 7.9.2009 with Python 3.10, PyTorch 1.13 and CUDA 11.6. It might be helpful to align the software versions especially Python and PyTorch versions. For this project my pytorch-lightning version is 1.9.5 so it might also be good to know which pytorch-lightning version you are using.

Lastly, I do observe some GPU memory leak issue with pytorch-lightning in other projects. It mainly happens during inference (with torch.no_grad()) where certain tensors were supposed to be freed after inference but they are somehow kept in the memory. Maybe you can try disabling validation routine during training by adding trainer.val_check_interval=null, and see if the OOM issue persists.

@taconite
Copy link
Owner

taconite commented Sep 3, 2024

Hi @anhnb206110 , the increase in memory consumption is due to the kicking in of physical properties. You can find its setting in the configs/config.yaml file. Personally, I'm handling it by reducing the no. samples per pixel. Note that if so you'll need to change some related codes in models/intrinsic_avatar.py line 1392-1407. Also, I'm not sure if it were the right approach, so I hope you too can update your solution while we're waiting for the authors recommendation.

I am not sure if you are encountering the same issue as @anhnb206110 - it seems that your GPU has limited memory to run at default SPP. TITAN RTX (24 GB) and above would be necessary to run the default config. If you want to use full SPP while reducing the VRAM usage, you can also try reducing model.secondary_shader_chunk to e.g. 80000 to trade training/inference speed for memory consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants