Skip to content
View Ground-A-Video's full-sized avatar

Block or report Ground-A-Video

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ground-A-Video/README.md

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models (ICLR 2024)

This repository contains the official pytorch implementation of Ground-A-Video.
[ICLR 2024] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
Hyeonho Jeong, Jong Chul Ye

Project Website arXiv

Abstract

Ground A Video is the first groundings-driven video editing framework, specially designed for Multi-Attribute Video Editing.
Ground A Video is the first framework to intergrate spatially-continuous and spatially-discrete conditions.
Ground A Video does not neglect edits, confuse edits, but does preserve non-target regions.
[TL;DR] Stable Diffusion 3D + ControlNet 3D + GLIGEN 3D + Optical Flow Smoothing = Ground-A-Video

Full abstract

We introduce a novel groundings guided video-to-video translation framework called Ground-A-Video. Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training T2V models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of cross-frame gated attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.

News

  • [11/11/2023] The paper is currently under review process.
  • [01/15/2024] Code released!

Teaser

Input Video Video Groundings Depth Map Optical Flow Output Video
"A man is walking a dog on the road." man, dog, road by ZoeDepth by RAFT-large "Iron Man is walking a sheep on the lake."
"A rabbit is eating a watermelon on the table." rabbit, watermelon, table by ZoeDepth by RAFT-large "A squirrel is eating an orange on the grass, under the aurora."

Setup

Requirements

git clone https://github.com/Ground-A-Video/Ground-A-Video.git
cd Ground-A-Video

conda create -n groundvideo python=3.8
conda activate groundvideo
pip install -r requirements.txt

Weights

Important: Ensure that you download the model weights before executing the scripts

git lfs install
git clone https://huggingface.co/gligen/gligen-inpainting-text-box
git clone https://huggingface.co/ground-a-video/unet3d_ckpts
git clone https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth

These commands will place the pretrained GLIGEN weights at:

  • Ground-A-Video/gligen-inpainting-text-box/diffusion_pytorch_model.bin
  • Ground-A-Video/unet3d_ckpts/diffusion_pytorch_model.bin
  • Ground-A-Video/unet3d_ckpts/config.json
  • Ground-A-Video/control_v11f1p_sd15_depth/diffusion_pytorch_model.bin
  • Ground-A-Video/control_v11f1p_sd15_depth/config.json

Alternatively, you can manually download the weights using the web interface from the following links:

Data

The input video frames should be stored in video_images , organized by each video's name.
Pre-computed groundings, including bounding box coordinates and corresponding text annotations, for each video are available in configuration files located at video_configs/{video_name}.yaml

Usage

Inference

Ground-A-Video is designed to be a training-free framework. To run the inference script, use the following command:

python main.py --config video_configs/rabbit_watermelon.yaml --folder outputs/rabbit_watermelon

Arguments

  • --config: Specifies the path to the configuration file. Modify the config files under video_configs as needed
  • --folder: Designates the directory where output videos will be saved
  • --clip_length: Sets the number of input video frames. Default is 8.
  • --denoising_steps: Defines the number of denoising steps. Default is 50.
  • --ddim_inv_steps: Determines the number of steps for per-frame DDIM inversion and Null-text Optimization. Default is 20.
  • --guidance_scale: Sets the CFG scale. Default is 12.5.
  • --flow_smooth_threshold: Threshold for optical flow guided smoothing. Default is 0.2.
  • --controlnet_conditioning_scale: Sets the conditioning scale for ControlNet. Default is 1.0.
  • --nti: Whether to perfrom Null-text Optimization after DDIM Inversion. Default is False.
    (If your CUDA Version is 11.4, then you can set is as True. If your CUDA Version is 12.2 or higher, set it as False: The codes are implemented using fp16 dtypes but in 12.2 higher CUDA version, gradient backpropagation from nti incurs errors)

Results

Input Videos Output Videos

Citation

If you like our work, please cite our paper.

@article{jeong2023ground,
  title={Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models},
  author={Jeong, Hyeonho and Ye, Jong Chul},
  journal={arXiv preprint arXiv:2310.01107},
  year={2023}
}

Acknowledgement

Popular repositories Loading

  1. Ground-A-Video Ground-A-Video Public

    Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models (ICLR 2024)

    Python 134 8

  2. Tune-A-Video-ControlNet Tune-A-Video-ControlNet Public

    This repository contains the pytorch implementation of ControlNet-Attached 'Tune-A-Video'.

    Python 1

  3. ground-a-video.github.io ground-a-video.github.io Public

    JavaScript

  4. awesome-video-generation awesome-video-generation Public

    Forked from AlonzoLeeeooo/awesome-video-generation

    A collection of awesome video generation studies.

    TeX

  5. dino-tracker dino-tracker Public

    Forked from AssafSinger94/dino-tracker

    Official Pytorch Implementation for “DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video”

    Python