This repository contains the official pytorch implementation of Ground-A-Video.
[ICLR 2024] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
Hyeonho Jeong,
Jong Chul Ye
Ground A Video is the first groundings-driven video editing framework, specially designed for Multi-Attribute Video Editing.
Ground A Video is the first framework to intergrate spatially-continuous and spatially-discrete conditions.
Ground A Video does not neglect edits, confuse edits, but does preserve non-target regions.
[TL;DR] Stable Diffusion 3D + ControlNet 3D + GLIGEN 3D + Optical Flow Smoothing = Ground-A-Video
Full abstract
We introduce a novel groundings guided video-to-video translation framework called Ground-A-Video. Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training T2V models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of cross-frame gated attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
- [11/11/2023] The paper is currently under review process.
- [01/15/2024] Code released!
git clone https://github.com/Ground-A-Video/Ground-A-Video.git
cd Ground-A-Video
conda create -n groundvideo python=3.8
conda activate groundvideo
pip install -r requirements.txt
Important: Ensure that you download the model weights before executing the scripts
git lfs install
git clone https://huggingface.co/gligen/gligen-inpainting-text-box
git clone https://huggingface.co/ground-a-video/unet3d_ckpts
git clone https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth
These commands will place the pretrained GLIGEN weights at:
Ground-A-Video/gligen-inpainting-text-box/diffusion_pytorch_model.bin
Ground-A-Video/unet3d_ckpts/diffusion_pytorch_model.bin
Ground-A-Video/unet3d_ckpts/config.json
Ground-A-Video/control_v11f1p_sd15_depth/diffusion_pytorch_model.bin
Ground-A-Video/control_v11f1p_sd15_depth/config.json
Alternatively, you can manually download the weights using the web interface from the following links:
The input video frames should be stored in video_images
, organized by each video's name.
Pre-computed groundings, including bounding box coordinates and corresponding text annotations, for each video are available in configuration files located at video_configs/{video_name}.yaml
Ground-A-Video is designed to be a training-free framework. To run the inference script, use the following command:
python main.py --config video_configs/rabbit_watermelon.yaml --folder outputs/rabbit_watermelon
--config
: Specifies the path to the configuration file. Modify the config files undervideo_configs
as needed--folder
: Designates the directory where output videos will be saved--clip_length
: Sets the number of input video frames. Default is 8.--denoising_steps
: Defines the number of denoising steps. Default is 50.--ddim_inv_steps
: Determines the number of steps for per-frame DDIM inversion and Null-text Optimization. Default is 20.--guidance_scale
: Sets the CFG scale. Default is 12.5.--flow_smooth_threshold
: Threshold for optical flow guided smoothing. Default is 0.2.--controlnet_conditioning_scale
: Sets the conditioning scale for ControlNet. Default is 1.0.--nti
: Whether to perfrom Null-text Optimization after DDIM Inversion. Default is False.
(If your CUDA Version is 11.4, then you can set is as True. If your CUDA Version is 12.2 or higher, set it as False: The codes are implemented using fp16 dtypes but in 12.2 higher CUDA version, gradient backpropagation from nti incurs errors)
Input Videos | Output Videos |
If you like our work, please cite our paper.
@article{jeong2023ground,
title={Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models},
author={Jeong, Hyeonho and Ye, Jong Chul},
journal={arXiv preprint arXiv:2310.01107},
year={2023}
}
- Ground-A-Video builds upon huge open-source projects:
diffusers, Stable Diffusion, GLIGEN, ControlNet, GLIP, RAFT.
Thank you for open-sourcing!