Skip to content
/ Align3R Public
forked from jiah-cloud/Align3R

[arXiv'24] Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

License

Notifications You must be signed in to change notification settings

Cufix/Align3R

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Version Β  Β  Β  CC BY-NC-SA 4.0 Hugging Face Model

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos Jiahao Lu*, Tianyu Huang*, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu Arxiv, 2024.

Align3R estimates temporally consistent video depth, dynamic point clouds, and camera poses from monocular videos. Watch the video

πŸš€ Quick Start

πŸ› οΈ Installation

  1. Clone this repo:
git clone git@github.com:jiah-cloud/Align3R.git
  1. Install dependencies:
conda create -n align3r python=3.11 cmake=3.14.0
conda activate align3r 
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia  # use the correct version of cuda for your system
pip install -r requirements.txt
# Optional: you can also install additional packages to:
# - add support for HEIC images
# - add pyrender, used to render depthmap in some datasets preprocessing
# - add required packages for visloc.py
pip install -r requirements_optional.txt
  1. Compile the cuda kernels for RoPE (as in CroCo v2):
cd croco/models/curope/
python setup.py build_ext --inplace
cd ../../../
  1. Install the monocular depth estimation model Depth Pro and Depth Anything V2:
# Depth Pro
cd third_party/ml-depth-pro
pip install -e .
source get_pretrained_models.sh
# Depth Anything V2
pip install transformers==4.41.2
  1. Download the corresponding model weights:

πŸ”₯πŸ”₯πŸ”₯ We upload our model weights to the Hugging Face, now you can download them via Align3R (Depth Pro) and Align3R (Depth Anything V2).

# DUSt3R
wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth

# Align3R 
# If you cannot download the weights using the following scripts, please download them locally.
gdown --fuzzy https://drive.google.com/file/d/1-qhRtgH7rcJMYZ5sWRdkrc2_9wsR1BBG/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1PPmpbASVbFdjXnD3iea-MRIHGmKsS8Vh/view?usp=sharing

# Depth Pro
cd third_party/ml-depth-pro
source get_pretrained_models.sh

# Raft
gdown --fuzzy https://drive.google.com/file/d/1KJxQ7KPuGHlSftsBCV1h2aYpeqQv3OI-/view?usp=drive_link -O models/

πŸ”§ Dataset Preparation

To train Align3R, you should download the following dataset:

Then use the following script to preprocess the training datasets:

bash datasets_preprocess/preprocess_trainingset.sh

After preprocessing, our folder structure is as follows:

β”œβ”€β”€ data
    β”œβ”€β”€ PointOdyssey_proc
    β”‚   β”œβ”€β”€ train
 Β Β  β”‚Β Β  └── val
    β”œβ”€β”€ spring_proc
    β”‚   └── train
    β”œβ”€β”€ Tartanair_proc
    β”œβ”€β”€ vkitti_2.0.3_proc
    └── SceneFlow
        β”œβ”€β”€ FlyingThings3D_proc
        β”‚   β”œβ”€β”€ TRAIN
        β”‚   β”‚   β”œβ”€β”€ A
        β”‚   β”‚   β”œβ”€β”€ B
        β”‚  Β β”‚Β   └── C
     Β Β  β”‚   └── TEST
        β”‚       β”œβ”€β”€ A
        β”‚       β”œβ”€β”€ B
        β”‚  Β Β    └── C
        β”œβ”€β”€ Driving_proc
        β”‚   β”œβ”€β”€ 35mm_focallength
Β Β       β”‚   └── 15mm_focallength
Β        └── Monkaa_proc

To evaluate, you should download the following dataset:

For Bonn and TUM dynamics, you should use the following script to preprocess them:

bash datasets_preprocess/preprocess_testset.sh

Our folder structure is as follows:

β”œβ”€β”€ data
    β”œβ”€β”€ bonn 
 Β Β  β”‚Β Β  └── rgbd_bonn_dataset
    β”œβ”€β”€ davis
    β”‚   └── DAVIS
    β”‚       β”œβ”€β”€ JPEGImages
    β”‚       β”‚   β”œβ”€β”€ 480P
    β”‚     Β  β”‚Β   └── 1080P
    β”‚       β”œβ”€β”€ Annotations
    β”‚       β”‚   β”œβ”€β”€ 480P
    β”‚     Β  β”‚Β   └── 1080P
    β”‚  Β Β    └── ImageSets
    β”‚           β”œβ”€β”€ 480P
    β”‚     Β   Β   └── 1080P
    β”œβ”€β”€ MPI-Sintel
    β”‚   β”œβ”€β”€ MPI-Sintel-training_images
    β”‚   β”‚   └── training
    β”‚   β”‚  Β     └── final
    β”‚   └── MPI-Sintel-depth-training
    β”‚       └── training
    β”‚     Β      β”œβ”€β”€ camdata_left
    β”‚           └── depth
    └── tum

To generate monocular depth maps, you should use the following script:

cd third_party/ml-depth-pro
bash infer.sh

🌟 Training

Please download the pretrained DUSt3R weight before training.

bash train.sh

πŸŽ‡ Demo

You can run the following demo code on any video. The input path can be either a mp4 video or an image folder.

bash demo.sh

πŸŽ‡ Evaluation

Video Depth

bash depth_test.sh

Please change the --dust3r_dynamic_model_path, --output_postfix, --dataset_name, --depth_prior_name.

Sintel
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthpro --dataset_name=sintel --eval --output_postfix="results/sintel_depth_ours_depthpro" 

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthanything --dataset_name=sintel --eval --output_postfix="results/sintel_depth_ours_depthanything" 
PointOdyssey
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthpro --dataset_name=PointOdyssey --eval --output_postfix="results/PointOdyssey_depth_ours_depthpro" 

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthanything --dataset_name=PointOdyssey --eval --output_postfix="results/PointOdyssey_depth_ours_depthanything" 
FlyingThings3D
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthpro --dataset_name=FlyingThings3D --eval --output_postfix="results/FlyingThings3D_depth_ours_depthpro" 

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthanything --dataset_name=FlyingThings3D --eval --output_postfix="results/FlyingThings3D_depth_ours_depthanything" 
Bonn
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthpro --dataset_name=bonn --eval --output_postfix="results/Bonn_depth_ours_depthpro" 

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthanything --dataset_name=bonn --eval --output_postfix="results/Bonn_depth_ours_depthanything" 
TUM dynamics
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthpro --dataset_name=tum --eval --output_postfix="results/tum_depth_ours_depthpro" 

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/depth_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --align_with_lad  --depth_max=70 --depth_prior_name=depthanything --dataset_name=tum --eval --output_postfix="results/tum_depth_ours_depthanything"

Camera Pose

We find that the flow loss proposed in MonST3R is crucial for pose estimation, so we have incorporated it into our implementation. We sincerely thank the authors of MonST3R for sharing the code for their outstanding work.

bash pose_test.sh

Please change the --dust3r_dynamic_model_path, --output_postfix, --dataset_name, --depth_prior_name.

Sintel
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/pose_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --output_postfix="results/sintel_pose_ours_depthpro" --dataset_name=sintel --depth_prior_name=depthpro --start_frame=0 --interval_frame=3000 --mode=eval_pose --scene_graph_type=swinstride-5-noncyclic

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/pose_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --output_postfix="results/sintel_pose_ours_depthanything" --dataset_name=sintel --depth_prior_name=depthanything --start_frame=0 --interval_frame=3000 --mode=eval_pose --scene_graph_type=swin-5-noncyclic
Bonn
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/pose_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --output_postfix="results/bonn_pose_ours_depthpro" --dataset_name=bonn --depth_prior_name=depthpro --start_frame=0 --interval_frame=30 --mode=eval_pose --scene_graph_type=swin-5-noncyclic

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/pose_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --output_postfix="results/bonn_pose_ours_depthanything" --dataset_name=bonn --depth_prior_name=depthanything --start_frame=0 --interval_frame=30 --mode=eval_pose --scene_graph_type=swin-5-noncyclic
TUM dynamics
# Depth Pro
CUDA_VISIBLE_DEVICES='0' python tool/pose_test.py --dust3r_dynamic_model_path="align3r_depthpro.pth" --output_postfix="results/tum_pose_ours_depthpro" --dataset_name=tum --depth_prior_name=depthpro --start_frame=0 --interval_frame=30 --mode=eval_pose --scene_graph_type=swin-5-noncyclic

# Depth Anything V2
CUDA_VISIBLE_DEVICES='0' python tool/pose_test.py --dust3r_dynamic_model_path="align3r_depthanything.pth" --output_postfix="results/tum_pose_ours_depthanything" --dataset_name=tum --depth_prior_name=depthanything --start_frame=0 --interval_frame=30 --mode=eval_pose --scene_graph_type=swin-5-noncyclic

πŸŽ₯ Visualization

Please use the viser to visualize the point cloud results, you can acquire the code from MonST3R. Thanks for their excellent work!

python viser/visualizer_monst3r.py --data path/dataset/video --init_conf --fg_conf_thre 1.0  --no_mask

πŸ“œ Citation

If you find our work useful, please cite:

@article{lu2024align3r,
  title={Align3R: Aligned Monocular Depth Estimation for Dynamic Videos},
  author={Lu, Jiahao and Huang, Tianyu and Li, Peng and Dou, Zhiyang and Lin, Cheng and Cui, Zhiming and Dong, Zhen and Yeung, Sai-Kit and Wang, Wenping and Liu, Yuan},
  journal={arXiv preprint arXiv:2412.03079},
  year={2024}
}

🀝 Acknowledgements

Our code is based on DUSt3R, MonST3R, Depth Pro, Depth Anything V2 and ControlNet. Our visualization code can acquired from MonST3R. We thank the authors for their excellent work!

About

[arXiv'24] Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.9%
  • Jupyter Notebook 1.6%
  • Other 1.5%