Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lower GPU memory usage #10

Open
blackight opened this issue Jun 15, 2024 · 7 comments
Open

lower GPU memory usage #10

blackight opened this issue Jun 15, 2024 · 7 comments

Comments

@blackight
Copy link

blackight commented Jun 15, 2024

It seems that the fp16 setting is not effective. I tried to use fp16 manually, and offload the autoencoder and CLIP to cpu memory during ddim denoising, and can run a 45x512x768 video on 12G GPU memory.

@wangxiang1230
Copy link
Collaborator

It seems that the fp16 setting is not effective. I tried to use fp16 manually, and offload the autoencoder and CLIP to cpu memory during ddim denoising, and can run a 45x512x768 video on 12G GPU memory.

Hi, thank you for your attention. You can merge your code into our code or write your improved code here to help others run models with fewer resources.

@wangxiang1230
Copy link
Collaborator

It seems that the fp16 setting is not effective. I tried to use fp16 manually, and offload the autoencoder and CLIP to cpu memory during ddim denoising, and can run a 45x512x768 video on 12G GPU memory.

Hi, I offload the autoencoder and CLIP, and the GPU memory used is ~22G. How do you reduce the memory? This may be useful.

@blackight
Copy link
Author

blackight commented Jun 15, 2024

    model = model.to(gpu)
    model.eval()
    model.to(torch.float16) # add this line
    # model = DistributedDataParallel(model, device_ids=[gpu]) if not cfg.debug else model # I delete DDP, because in my PC it increace memory usage
    ............
    clip_encoder.cpu() # add this line
    autoencoder.cpu() # add this line
    torch.cuda.empty_cache() # add this line
    video_data = diffusion.ddim_sample_loop(
    noise=noise_one,
    model=model.eval(), 
    model_kwargs=model_kwargs_one,
    guide_scale=cfg.guide_scale,
    ddim_timesteps=cfg.ddim_timesteps,
    eta=0.0)
    # if run forward of  autoencoder or clip_encoder second times, load them again
    clip_encoder.cuda()
    autoencoder.cuda()

my code is like this, need both fp16 in unet and cpu offload

@wangxiang1230
Copy link
Collaborator

Good, thanks for your contribution. I will add it to our code.

@zephirusgit
Copy link

I haven't been able to get it to run well, with an RTX 2060 with 12GB of vram, I see that it uses 21GB of shared memory, it starts but it was for a while and it didn't go over 0% so I stopped it, from what I see it is designed for some guy of gpus cruster, and I have had to modify it so that it does not ask me for that since there is no nccl in Windows. (I don't have any more GPUs either)

@wangxiang1230
Copy link
Collaborator

wangxiang1230 commented Jun 16, 2024

I haven't been able to get it to run well, with an RTX 2060 with 12GB of vram, I see that it uses 21GB of shared memory, it starts but it was for a while and it didn't go over 0% so I stopped it, from what I see it is designed for some guy of gpus cruster, and I have had to modify it so that it does not ask me for that since there is no nccl in Windows. (I don't have any more GPUs either)

Hi, thank you for your attention. We noticed your problem, since we didn't have a windows machine, so we couldn't help modify it. You can try to change max_frames to 16 or 24. We also welcome your comments and hope you to improve the code. We will incorporate the improved code into our code so that more people (researchers using different systems) can run the program. Thank you.

@blackight
Copy link
Author

I haven't been able to get it to run well, with an RTX 2060 with 12GB of vram, I see that it uses 21GB of shared memory, it starts but it was for a while and it didn't go over 0% so I stopped it, from what I see it is designed for some guy of gpus cruster, and I have had to modify it so that it does not ask me for that since there is no nccl in Windows. (I don't have any more GPUs either)

you can delete the DistributedDataParallel in the code to prevent from using gpus cluster, or change "nccl" to "gloo". refer to https://pytorch.org/docs/stable/distributed.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants