OUT OF MEMORY #7

BlackPuuuudding · 2024-04-26T07:28:56Z

Why does my training run out of memory when the batch size is set to 4, and out of memory when the batch size is set to 2 during multi-GPU training, yet the paper is able to set it to 8? I'm using the same device as the one mentioned in the paper, which is a 4090, and the cpkt is SD 1.4 and interact-diffusion-v1-1.pth.
Thank you!!

jiuntian · 2024-04-26T07:34:34Z

We use this command for training:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt <existing_gligen_checkpoint> --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name <existing SD v1.4/v1.5 checkpoint>

We use AMP, batch size is set to 4 for each GPU.

Training detail at readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OUT OF MEMORY #7

OUT OF MEMORY #7

BlackPuuuudding commented Apr 26, 2024

jiuntian commented Apr 26, 2024

OUT OF MEMORY #7

OUT OF MEMORY #7

Comments

BlackPuuuudding commented Apr 26, 2024

jiuntian commented Apr 26, 2024