A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing, CVPR 2024
Maomao Li, Yu Li, Tianyu Yang, Yunfei Liu, Dongxu Yue, Zhihui Lin, Dong Xu
Reconstruction comparison between DDIM and STEM inversion. DDIM inversion in existing video editing methods usually exploits 1-frame or 2-frame context to invert each frame. Thus, we design a more radical inflated DDIM inversion that uses all-frame context as reference. Here, we use the typical DDIM reconstruction method to provide a video reconstruction comparison, where both our STEM inversion and the radical inflated DDIM one can explore context from the entire video, while the resource-consuming latter yields inferior performance.
TL; DR: STEM inversion is a efficient video inversion method for text-guided video editing.
Click for the full abstract
We present a video inversion approach for zero-shot video editing, which aims to model the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods.
The illustration of the proposed STEM inversion method. We estimate a more compact representation (bases
- 2023.12.11 Paper is released!
- 2024.05.01 The code based on TokenFlow editing is released!
- Release the STEM inversion code
- Release the code based on FateZero editing
Prepare the Conda environment using the following commands:
git clone https://github.com/STEM-Inv/stem-inv
cd STEM-Inv
cd TokenFlow-Edit
conda create -n stem-tf python=3.9
conda activate stem-tf
pip install -r requirements.txt
We provide demo source videos in the data
folder.
The corresponding config for STEM Inversion and Editing is in the configs
folder.
Below are the instructions for performing video editing on the provided source videos.
You can run the following command to perform inversion and editing process at once:
bash run_editing.sh
The inversion results are saved in Stem_Inv_Latents/base_256_iter_5
, and the editing results are saved in STEM_TF_results
.
If you are only interested the reconstruction results of STEM inversion, please run:
bash run_inversion.sh
Note that our default setting is to use 256 bases to represent the whole input video, where 5 iterations are applied for EM algorithm convergence. You can also consider other configurations by modifying the values of “num_bases” and "n_iters" in line 88 of tokenflow_utils.py.
Source prompt: A man is playing tennis. Target prompt: Spider-Man is playing tennis. |
@inproceedings{li2024video,
title={A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing},
author={Li, Maomao and Li, Yu and Yang, Tianyu and Liu, Yunfei and Yue, Dongxu and Lin, Zhihui and Xu, Dong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7528--7537},
year={2024}
}
This is official code of STEM Inversion. All the copyrights of the demo images and audio are from community users. Feel free to contact us if you would like remove them.
The code is built upon the below repositories, we thank all the contributors for open-sourcing.