CUDA out of memory on a 40G GPU #11

GulpFire · 2024-06-16T03:46:52Z

Following docs/ISSUES.md and docs/SAMPLING.md set, but still out of memory. Here's my config and instruction

In configs/inference/vista.yaml, change en_and_decode_n_samples_a_time to 1

model:
  target: vwm.models.diffusion.DiffusionEngine
  params:
    input_key: img_seq
    scale_factor: 0.18215
    disable_first_stage_autocast: True
    en_and_decode_n_samples_a_time: 1
    num_frames: &num_frames 25

then run sample with

python sample.py  --low_vram

caught by

Traceback (most recent call last):
  File "/gemini/code/Vista/sample.py", line 245, in <module>
    out = do_sample(
  File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/gemini/code/Vista/sample_utils.py", line 304, in do_sample
    c, uc = get_condition(model, value_dict, num_frames, force_uc_zero_embeddings, device)
  File "/gemini/code/Vista/sample_utils.py", line 262, in get_condition
    c, uc = model.conditioner.get_unconditional_conditioning(
  File "/gemini/code/Vista/vwm/modules/encoders/modules.py", line 175, in get_unconditional_conditioning
    c = self(batch_c, force_cond_zero_embeddings)
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gemini/code/Vista/vwm/modules/encoders/modules.py", line 127, in forward
    emb_out = embedder(batch[embedder.input_key])
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gemini/code/Vista/vwm/modules/encoders/modules.py", line 488, in forward
    out = self.encoder.encode(vid[n * n_samples: (n + 1) * n_samples])
  File "/gemini/code/Vista/vwm/models/autoencoder.py", line 470, in encode
    z = self.encoder(x)
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gemini/code/Vista/vwm/modules/diffusionmodules/model.py", line 540, in forward
    h = self.down[i_level].block[i_block](hs[-1], temb)
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gemini/code/Vista/vwm/modules/diffusionmodules/model.py", line 119, in forward
    h = nonlinearity(h)
  File "/gemini/code/Vista/vwm/modules/diffusionmodules/model.py", line 48, in nonlinearity
    return x * torch.sigmoid(x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.52 GiB (GPU 0; 39.40 GiB total capacity; 37.00 GiB already allocated; 1.62 GiB free; 37.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Segmentation fault (core dumped)

Did I miss some useful setting? Any help would be appreciated.

The text was updated successfully, but these errors were encountered:

zhangxiao696 · 2024-06-24T13:05:35Z

I also encountered the same problem, hope to receive help

shashankvkt · 2024-06-27T15:33:36Z

I faced a similar challenge.
So, I dont think this a permanent solution, but just to see the working of the methods, I decreased the resolution of generated video to 320 x 576 along with setting en_and_decode_n_samples_a_time: 1

Maybe the authors @zhangxiao696 can provide a better solution.
Thanks

zhangxiao696 · 2024-06-28T04:11:11Z

I faced a similar challenge. So, I dont think this a permanent solution, but just to see the working of the methods, I decreased the resolution of generated video to 320 x 576 along with setting en_and_decode_n_samples_a_time: 1

Maybe the authors @zhangxiao696 can provide a better solution. Thanks

I change num_frames and n_frames to 22 can works. Same as you , it's only a temporary solution

shashankvkt · 2024-06-28T07:44:43Z

@Little-Podi @kashyap7x @YTEP-ZHI, may I ask if you guys have any alternate solution?

Little-Podi · 2024-06-28T08:29:19Z

Sorry, we do not have any other useful tips to provide at the moment. Currently, only --low_vram is able to save the memory without degrading the quality. Other solutions, such as reducing the resolution and the number of frames, are likely to generate inferior results. We will try our best to solve this challenge.

roym899 · 2024-07-12T18:13:01Z

There is actually a way to further reduce memory usage without reducing the quality.

The encoder processes all frames in parallel, but this can be changed to process them in sequence instead. After this change the model works fine with less than 20GB memory.
Basically replace

Vista/vwm/modules/encoders/modules.py

Line 127 in cea9cd9

emb_out = embedder(batch[embedder.input_key])

with
https://github.com/rerun-io/hf-example-vista/blob/381b9d574befe0e9a60e9130980d8da0aec5c6ec/vista/vwm/modules/encoders/modules.py#L129-L134

LMD0311 · 2024-07-13T09:28:11Z

There is actually a way to further reduce memory usage without reducing the quality.

The encoder processes all frames in parallel, but this can be changed to process them in sequence instead. After this change the model works fine with less than 20GB memory. Basically replace

Vista/vwm/modules/encoders/modules.py

Line 127 in cea9cd9

emb_out = embedder(batch[embedder.input_key])

with
https://github.com/rerun-io/hf-example-vista/blob/381b9d574befe0e9a60e9130980d8da0aec5c6ec/vista/vwm/modules/encoders/modules.py#L129-L134

I attempted your approach with reduced resolution (64*64), num_frames=4, checkpoint and LoRA, but there is still insufficient memory on eight 24G GPUs.

roym899 · 2024-07-13T09:34:59Z

Maybe try the fork I linked and see if that works. It works fine for me 25 frames, any number of segments, full resolution. You also have to use the low memory mode if you aren't yet.

LMD0311 · 2024-07-13T09:36:41Z

Maybe try the fork I linked and see if that works. It works fine for me 25 frames, any number of segments, full resolution. You also have to use the low memory mode if you aren't yet.

Thanks for replying!

wangsdchn · 2024-08-12T12:44:44Z

There is actually a way to further reduce memory usage without reducing the quality.

The encoder processes all frames in parallel, but this can be changed to process them in sequence instead. After this change the model works fine with less than 20GB memory. Basically replace

Vista/vwm/modules/encoders/modules.py

Line 127 in cea9cd9

emb_out = embedder(batch[embedder.input_key])

with
https://github.com/rerun-io/hf-example-vista/blob/381b9d574befe0e9a60e9130980d8da0aec5c6ec/vista/vwm/modules/encoders/modules.py#L129-L134

solve my problem !!!

TianDianXin · 2024-11-13T07:44:54Z

There is actually a way to further reduce memory usage without reducing the quality.

The encoder processes all frames in parallel, but this can be changed to process them in sequence instead. After this change the model works fine with less than 20GB memory. Basically replace

Vista/vwm/modules/encoders/modules.py

Line 127 in cea9cd9

emb_out = embedder(batch[embedder.input_key])

with
https://github.com/rerun-io/hf-example-vista/blob/381b9d574befe0e9a60e9130980d8da0aec5c6ec/vista/vwm/modules/encoders/modules.py#L129-L134

Thank you for your sharing, I can successfully use 40G GPU memory for reasoning, but can not train, even if the resolution is 320*576, may I ask how you successfully test training in small GPU memory ?

SEU-zxj · 2024-12-06T05:57:29Z

@TianDianXin Seems at this time, we can only switch A100 40G to A100 80G😂

SEU-zxj · 2024-12-06T06:46:22Z

Hello, everyone!
I have a question.
Does the method proposed by @roym899 will affect the model's results? I am worried about that......

there maybe some batch normalization operations in the encoder, and encode the batch in parallel will produce different results compared with encode the batch sequentially and then concat the results......

@Little-Podi need your help😖

YTEP-ZHI · 2024-12-09T13:56:01Z

Hi @SEU-zxj, I think you can do that modification confidently. There are no batchnorm included in the model, thereby encoding the batch sequentially will NOT hurt the performance.

SEU-zxj · 2024-12-10T12:19:34Z

OK, Thanks for your reply! @YTEP-ZHI

YTEP-ZHI added the enhancement New feature or request label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory on a 40G GPU #11

CUDA out of memory on a 40G GPU #11

GulpFire commented Jun 16, 2024

zhangxiao696 commented Jun 24, 2024

shashankvkt commented Jun 27, 2024

zhangxiao696 commented Jun 28, 2024

shashankvkt commented Jun 28, 2024

Little-Podi commented Jun 28, 2024 •

edited

Loading

roym899 commented Jul 12, 2024 •

edited

Loading

LMD0311 commented Jul 13, 2024

roym899 commented Jul 13, 2024

LMD0311 commented Jul 13, 2024

wangsdchn commented Aug 12, 2024

TianDianXin commented Nov 13, 2024

SEU-zxj commented Dec 6, 2024

SEU-zxj commented Dec 6, 2024

YTEP-ZHI commented Dec 9, 2024

SEU-zxj commented Dec 10, 2024

CUDA out of memory on a 40G GPU #11

CUDA out of memory on a 40G GPU #11

Comments

GulpFire commented Jun 16, 2024

zhangxiao696 commented Jun 24, 2024

shashankvkt commented Jun 27, 2024

zhangxiao696 commented Jun 28, 2024

shashankvkt commented Jun 28, 2024

Little-Podi commented Jun 28, 2024 • edited Loading

roym899 commented Jul 12, 2024 • edited Loading

LMD0311 commented Jul 13, 2024

roym899 commented Jul 13, 2024

LMD0311 commented Jul 13, 2024

wangsdchn commented Aug 12, 2024

TianDianXin commented Nov 13, 2024

SEU-zxj commented Dec 6, 2024

SEU-zxj commented Dec 6, 2024

YTEP-ZHI commented Dec 9, 2024

SEU-zxj commented Dec 10, 2024

Little-Podi commented Jun 28, 2024 •

edited

Loading

roym899 commented Jul 12, 2024 •

edited

Loading