Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Inference Support or Video Splitting for Long Video Processing #6

Open
zRzRzRzRzRzRzR opened this issue Aug 18, 2024 · 15 comments

Comments

@zRzRzRzRzRzRzR
Copy link

We are working with videos that range from 6 to 10 seconds in length, which obviously leads to Out Of Memory (OOM) errors during processing. We have access to high-performance hardware, such as multiple A100 GPUs.

  1. Is there a way to implement multi-GPU inference to handle these longer videos? If so, could you provide guidance on how to set it up?
  2. If multi-GPU inference is not supported, is there a method to split the video into smaller segments for processing? We are concerned that splitting the video might degrade the final output quality. Could you suggest the best practices to minimize quality loss in this scenario?
@hejingwenhejingwen
Copy link
Collaborator

I am working on processing arbitrary long videos. The update will be released in two days.

@hejingwenhejingwen
Copy link
Collaborator

Hi, please check the results here: #8

@zRzRzRzRzRzRzR
Copy link
Author

Sure, Check this asap, thks!

@zRzRzRzRzRzRzR
Copy link
Author

is any ckpt changed that I found need to load
laion2b_s32b_b79k model

@hejingwenhejingwen
Copy link
Collaborator

The ckpts are the same as previous ones. laion2b_s32b_b79k model is: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/resolve/main/open_clip_pytorch_model.bin

@zRzRzRzRzRzRzR
Copy link
Author

/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-08-20 13:25:17,553 - video_to_video - INFO - checkpoint_path: ./ckpts/venhancer_paper.pt
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/open_clip/factory.py:88: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(checkpoint_path, map_location=map_location)
2024-08-20 13:25:37,486 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:35: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  load_dict = torch.load(cfg.model_path, map_location='cpu')
2024-08-20 13:25:55,391 - video_to_video - INFO - Load model path ./ckpts/venhancer_paper.pt, with local status <All keys matched successfully>
2024-08-20 13:25:55,392 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-08-20 13:26:16,092 - video_to_video - INFO - input video path: inputs/000000.mp4
2024-08-20 13:26:16,093 - video_to_video - INFO - text: Wide-angle aerial shot at dawn,soft morning light casting long shadows,an elderly man walking his dog through a quiet,foggy park,trees and benches in the background,peaceful and serene atmosphere
2024-08-20 13:26:16,156 - video_to_video - INFO - input frames length: 49
2024-08-20 13:26:16,156 - video_to_video - INFO - input fps: 8.0
2024-08-20 13:26:16,156 - video_to_video - INFO - target_fps: 24.0
2024-08-20 13:26:16,311 - video_to_video - INFO - input resolution: (480, 720)
2024-08-20 13:26:16,312 - video_to_video - INFO - target resolution: (1320, 1982)
2024-08-20 13:26:16,312 - video_to_video - INFO - noise augmentation: 250
2024-08-20 13:26:16,312 - video_to_video - INFO - scale s is set to: 8
2024-08-20 13:26:16,399 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:108: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(enabled=True):
2024-08-20 13:27:19,605 - video_to_video - INFO - step: 0
2024-08-20 13:30:12,020 - video_to_video - INFO - step: 1
2024-08-20 13:33:04,956 - video_to_video - INFO - step: 2
2024-08-20 13:35:58,691 - video_to_video - INFO - step: 3
2024-08-20 13:38:51,254 - video_to_video - INFO - step: 4
2024-08-20 13:41:44,150 - video_to_video - INFO - step: 5
2024-08-20 13:44:37,017 - video_to_video - INFO - step: 6
2024-08-20 13:47:30,037 - video_to_video - INFO - step: 7
2024-08-20 13:50:22,838 - video_to_video - INFO - step: 8
2024-08-20 13:53:15,844 - video_to_video - INFO - step: 9
2024-08-20 13:56:08,657 - video_to_video - INFO - step: 10
2024-08-20 13:59:01,648 - video_to_video - INFO - step: 11
2024-08-20 14:01:54,541 - video_to_video - INFO - step: 12
2024-08-20 14:04:47,488 - video_to_video - INFO - step: 13
2024-08-20 14:10:13,637 - video_to_video - INFO - sampling, finished.

SO slow, is it normal ,running in single A100

@hejingwenhejingwen
Copy link
Collaborator

So sad it is normal. It makes senses because you are processing high-resolution and high-frame-rate videos.
Multiple gpu inference may help, but don't expect too much:(

@zRzRzRzRzRzRzR
Copy link
Author

zRzRzRzRzRzRzR commented Aug 20, 2024

So sad it is normal. It makes senses because you are processing high-resolution and high-frame-rate videos. Multiple gpu inference may help, but don't expect too much:(

how to configure, did not saw it in readme and btw, It’s absolutely necessary to set the prompt to be the same as the one used to generate the video in CogVideoX, right?

@hejingwenhejingwen
Copy link
Collaborator

The Multiple gpu inference is not supported right now, but we are working on it.
VEnhancer is trained with short captions mostly, not sure it can understand long captions. It may generate unpleasing textures(not sure) if you provide too many words. More importantly, the max words is 77 in our used clip.

@zRzRzRzRzRzRzR
Copy link
Author

Oh, that’s an issue because CogVideoX supports long text, typically exceeding 77 words, usually around 150-220 words.

I’d like to know how to reproduce your rendered video. How should the prompt be written, given that the original video prompt is longer than 77 words?

@hejingwenhejingwen
Copy link
Collaborator

I only adopt the first sentence. For example: The camera follows behinds a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope.

The results that I present in README is not processed by the released VEnhancer checkpoint. The released VEnhancer has powerful generative ability and is more suitable for lower-quality and lower-resolution AIGC videos. But CogVideoX can already produce good videos, so I use another checkpoint for just enhancing temporal consistency and removing unpleasing textures.

@zRzRzRzRzRzRzR
Copy link
Author

So, with the released version, it’s possible to reproduce the results if only use the first sentence of the prompt?
I’m currently writing the quick start guide for this and preparing to post it in the CogVideoX community. I need to confirm this issue :)

@hejingwenhejingwen
Copy link
Collaborator

hejingwenhejingwen commented Aug 20, 2024

The released ckpt; up_scale=3; noise_aug=200; target_fps=24, prompt="A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea"

It will produce results like this:

A.detailed.wooden.toy.ship.with.intricately.carved.masts.and.sails.is.seen.gliding.smoothly.over.a.plush.blue.carpet.mp4

If you are happy with this, you can use the above parameters.
Actually, up_scale can be set to 2, if you cannot wait. But the quality will degrade. Besides, fps>=16 is already very smooth, so you can also decline the target_fps to 16. noise_aug controls the refinement strength, it depends on users' preference.

@zRzRzRzRzRzRzR
Copy link
Author

@hejingwenhejingwen
Copy link
Collaborator

hejingwenhejingwen commented Aug 20, 2024

https://github.com/THUDM/CogVideo/pull/143/files#diff-9e657cda0980a4aee4b86550d3640347df4f55f3ac3a827132471681fdc7f52c

Is this guide work(I tested it and work for me)? If OK I will push it

-up_scale is recommend to be set to 3,4, or 2 if the resolution of input video is already high. The target resolution is limited to be around 2k and below.
-noise_aug value depends on the input video quality. Lower quality needs higher noise levels, which corresponds to stronger refinement. 250~300 is for very low-quality videos. good videos: <= 200.
-if you want fewer steps, please change solver_mode to "normal" first, then decline the number of steps. "fast" solver_mode has fixed steps (15).

These are my comments. Thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants