Ground-A-Video

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models (ICLR 2024)

This repository contains the official pytorch implementation of Ground-A-Video.
[ICLR 2024] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
Hyeonho Jeong, Jong Chul Ye

Abstract

Ground A Video is the first groundings-driven video editing framework, specially designed for Multi-Attribute Video Editing.
Ground A Video is the first framework to intergrate spatially-continuous and spatially-discrete conditions.
Ground A Video does not neglect edits, confuse edits, but does preserve non-target regions.
[TL;DR] Stable Diffusion 3D + ControlNet 3D + GLIGEN 3D + Optical Flow Smoothing = Ground-A-Video

Full abstract

We introduce a novel groundings guided video-to-video translation framework called Ground-A-Video. Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training T2V models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of cross-frame gated attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.

News

[11/11/2023] The paper is currently under review process.
[01/15/2024] Code released!

Teaser

Input Video	Video Groundings	Depth Map	Optical Flow	Output Video
"A man is walking a dog on the road."	man, dog, road	by ZoeDepth	by RAFT-large	"Iron Man is walking a sheep on the lake."

"A rabbit is eating a watermelon on the table."	rabbit, watermelon, table	by ZoeDepth	by RAFT-large	"A squirrel is eating an orange on the grass, under the aurora."

Setup

Requirements

git clone https://github.com/Ground-A-Video/Ground-A-Video.git
cd Ground-A-Video

conda create -n groundvideo python=3.8
conda activate groundvideo
pip install -r requirements.txt

Weights

Important: Ensure that you download the model weights before executing the scripts

git lfs install
git clone https://huggingface.co/gligen/gligen-inpainting-text-box
git clone https://huggingface.co/ground-a-video/unet3d_ckpts
git clone https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth

These commands will place the pretrained GLIGEN weights at:

Ground-A-Video/gligen-inpainting-text-box/diffusion_pytorch_model.bin
Ground-A-Video/unet3d_ckpts/diffusion_pytorch_model.bin
Ground-A-Video/unet3d_ckpts/config.json
Ground-A-Video/control_v11f1p_sd15_depth/diffusion_pytorch_model.bin
Ground-A-Video/control_v11f1p_sd15_depth/config.json

Alternatively, you can manually download the weights using the web interface from the following links:

Data

The input video frames should be stored in video_images , organized by each video's name.
Pre-computed groundings, including bounding box coordinates and corresponding text annotations, for each video are available in configuration files located at video_configs/{video_name}.yaml

Usage

Inference

Ground-A-Video is designed to be a training-free framework. To run the inference script, use the following command:

python main.py --config video_configs/rabbit_watermelon.yaml --folder outputs/rabbit_watermelon

Arguments

--config: Specifies the path to the configuration file. Modify the config files under video_configs as needed
--folder: Designates the directory where output videos will be saved
--clip_length: Sets the number of input video frames. Default is 8.
--denoising_steps: Defines the number of denoising steps. Default is 50.
--ddim_inv_steps: Determines the number of steps for per-frame DDIM inversion and Null-text Optimization. Default is 20.
--guidance_scale: Sets the CFG scale. Default is 12.5.
--flow_smooth_threshold: Threshold for optical flow guided smoothing. Default is 0.2.
--controlnet_conditioning_scale: Sets the conditioning scale for ControlNet. Default is 1.0.
--nti: Whether to perfrom Null-text Optimization after DDIM Inversion. Default is False.
(If your CUDA Version is 11.4, then you can set is as True. If your CUDA Version is 12.2 or higher, set it as False: The codes are implemented using fp16 dtypes but in 12.2 higher CUDA version, gradient backpropagation from nti incurs errors)

Results

Input Videos	Output Videos

Citation

If you like our work, please cite our paper.

@article{jeong2023ground,
  title={Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models},
  author={Jeong, Hyeonho and Ye, Jong Chul},
  journal={arXiv preprint arXiv:2310.01107},
  year={2023}
}

Acknowledgement

Ground-A-Video builds upon huge open-source projects:
diffusers, Stable Diffusion, GLIGEN, ControlNet, GLIP, RAFT.
Thank you for open-sourcing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly