[arXiv] [Project Page] [Dataset]
Jianhong Bai1*, Menghan Xia2β , Xintao Wang2, Ziyang Yuan3, Xiao Fu4,
Zuozhu Liu1, Haoji Hu1, Pengfei Wan2, Di Zhang2
(*Work done during an internship at KwaiVGI, Kuaishou Technology β corresponding author)
1Zhejiang University, 2Kuaishou Technology, 3Tsinghua University, 4CUHK.
TL;DR: We propose SynCamMaster, an efficient method to lift pre-trained text-to-video models for open-domain multi-camera video generation from diverse viewpoints.
teaser_video_compressed.mp4
- [2024.12.10]: Release the project page and the SynCamVideo Dataset.
The SynCamVideo Dataset is a multi-camera synchronized video dataset rendered using the Unreal Engine 5. It consists of 1,000 different scenes, each captured by 36 cameras, resulting in a total of 36,000 videos. SynCamVideo features 50 different animals as the "main subjects" and utilizes 20 different locations from Poly Haven as backgrounds. In each scene, 1-2 subjects are selected from the 50 animals and move along a predefined trajectory, the background is randomly chosen from the 20 locations, the 36 cameras simultaneously record the subjects' movements.
The cameras in each scene are placed on a hemispherical surface at a distance to the scene center of 3.5m - 9m. To ensure the rendered videos have minimal domain shift with real-world videos, we constraint the elevation of each camera between 0Β° - 45Β°, and the azimuth between 0Β° - 360Β°. Each camera is randomly sampled within the constraints described above, rather than using the same set of camera positions across scenes. The figure below shows an example, where the red star indicates the center point of the scene (slightly above the ground), and the videos are rendered from the synchronized cameras to capture the movements of the main subjects (a goat and a bear in the case).
The SynCamVideo Dataset can be used to train multi-camera synchronized video generation models, inspiring applications in areas such as filmmaking and multi-view data generation for downstream tasks.
SynCamVideo
βββ train
β βββ videos # training videos
β β βββ scene1 # one scene
β β β βββ xxx.mp4 # synchronized 100-frame videos at 480x720 resolution
β β β βββ ...
β β β ...
β β βββ scene1000
β β βββ xxx.mp4
β β βββ ...
β βββ cameras # training cameras
β βββ scene1 # one scene
β β βββ xxx.json # extrinsic parameters corresponding to the videos
β β ...
β βββ scene1000
β βββ xxx.json
βββval
βββ cameras # validation cameras
βββ Hemi36_4m_0 # distance=4m, elevation=0Β°
β βββ Hemi36_4m_0.json # 36 cameras: distance=4m, elevation=0Β°, azimuth=i * 10Β°
β ...
βββ Hemi36_7m_45
βββ Hemi36_7m_45.json
- Camera Visualization
python vis_cam.py --pose_file_path ./val/cameras/Hemi36_4m_0/Hemi36_4m_0_transforms.json --num_cameras 36
The visualization script is modified from CameraCtrl, thanks for their inspiring work.
Note: The model we used in our paper is an internal research propose T2V model, not CogVideoX. Due to company policy restrictions, we are unable to open-source the model used in the paper. Therefore, we migrated SynCamMaster to CogVideoX to validate the effectiveness of our method. As a result, due to the differences in the base T2V model, you may not be able to achieve the same results as demonstrated in the demo.
Our environment setup is identical to CogVideoX. You can refer to their configuration to complete your environment setup.
conda create -n syncammaster python=3.10
conda activate syncammaster
pip install -r requirements.txt
TODO: upload the pre-trained checkpoints.
The following code showcases the core components of SynCamMaster, namely the camera encoder, multi-view attention layer, and a linear projector within each transformer block, as demonstrated in Fig. 2 of our paper.
# 1. add pose feature
pose = rearrange(pose, "b v d -> (b v) 1 d")
pose_embedding = self.cam_encoder(pose)
norm_hidden_states = norm_hidden_states + pose_embedding
# 2. multi-view attention
norm_hidden_states = rearrange(norm_hidden_states, "(b v) (f s) d -> (b f) (v s) d", f=frame_num, v=view_num)
norm_encoder_hidden_states = rearrange(norm_encoder_hidden_states, "(b v) n d -> b (v n) d", v=view_num)
norm_encoder_hidden_states = repeat(norm_encoder_hidden_states, "b n d -> (b f) n d", f=frame_num)
attn_hidden_states, _ = self.attn_syncam(
hidden_states=norm_hidden_states,
encoder_hidden_states=norm_encoder_hidden_states,
image_rotary_emb=image_rotary_emb_view,
)
# 3. project back with residual connection
attn_hidden_states = self.projector(attn_hidden_states)
attn_hidden_states = rearrange(attn_hidden_states, "(b f) (v s) d -> (b v) (f s) d", f=frame_num, v=view_num)
hidden_states = hidden_states + gate_msa * attn_hidden_states
python syncammaster_inference.py --model_path THUDM/CogVideoX-2b
Feel free to explore these outstanding related works, including but not limited to:
GCD: synthesize large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
CVD: multi-view video generation with multiple camera trajectories.
SV4D: multi-view consistent dynamic 3D content generation.
Additionally, check out our "MasterFamily" projects:
3DTrajMaster: control multiple entity motions in 3D space (6DoF)Β for text-to-video generation.
StyleMaster: enable artistic video generation and translation with reference style image.
We thank Jinwen Cao, Yisong Guo, Haowen Ji, Jichao Wang, and Yi Wang from Kuaishou Technology for their invaluable help in constructing the SynCamVideo-Dataset. We thank Guanjun Wu and Jiangnan Ye for their help on running 4DGS.
Please leave us a star π and cite our paper if you find our work helpful.
@misc{bai2024syncammaster,
title={SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints},
author={Jianhong Bai and Menghan Xia and Xintao Wang and Ziyang Yuan and Xiao Fu and Zuozhu Liu and Haoji Hu and Pengfei Wan and Di Zhang},
year={2024},
eprint={2412.07760},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.07760},
}