This is the reference PyTorch implementation for training and testing MVS depth estimation models using the method described in
SimpleRecon: 3D Reconstruction Without 3D Convolutions
Mohamed Sayed, John Gibson, Jamie Whatson, Victor Adrian Prisacariu, Michael Firman, and Clรฉment Godard
Paper, ECCV 2022 (arXiv pdf), Supplemental Material, Project Page, Video
Birdseye.Live.Reconstruction.mp4
Living.Room.Birdseye.mp4
This code is for non-commercial use; please see the license file for terms. If you do find any part of this codebase helpful, please cite our paper using the BibTex below and link this repo. Thanks!
- ๐บ๏ธ Overview
- โ๏ธ Setup
- ๐ฆ Models
- ๐ Speed
- ๐ TODOs:
- ๐ Running out of the box!
- ๐พ ScanNetv2 Dataset
- ๐ผ๏ธ๐ผ๏ธ๐ผ๏ธ Frame Tuples
- ๐ Testing and Evaluation
- ๐โ๏ธ Point Cloud Fusion
- ๐ Mesh Metrics
- โณ Training
- ๐ง Other training and testing options
- โจ Visualization
- ๐๐งฎ๐ฉโ๐ป Notation for Transformation Matrices
- ๐บ๏ธ World Coordinate System
- ๐๐ง Bug Fixes
- ๐ Acknowledgements
- ๐ BibTeX
- ๐ฉโโ๏ธ License
SimpleRecon takes as input posed RGB images, and outputs a depth map for a target image.
Assuming a fresh Anaconda distribution, you can install dependencies with:
conda env create -f simplerecon_env.yml
We ran our experiments with PyTorch 1.10, CUDA 11.3, Python 3.9.7 and Debian GNU/Linux 10.
Download a pretrained model into the weights/
folder.
We provide the following models:
--config |
Model | Abs Diffโ | Sq Relโ | delta < 1.05โ | Chamferโ | F-Scoreโ |
---|---|---|---|---|---|---|
hero_model.yaml |
Metadata + Resnet Matching | 0.0885 | 0.0125 | 73.16 | 5.81 | 0.671 |
dot_product_model.yaml |
Dot Product + Resnet Matching | 0.0941 | 0.0139 | 70.48 | 6.29 | 0.642 |
hero_model
is the one we use in the paper as Ours
--config |
Model | Inference Speed (--batch_size 1 ) |
Inference GPU memory | Approximate training time |
---|---|---|---|---|
hero_model |
Hero, Metadata + Resnet | 130ms / 70ms (speed optimized) | 2.6GB / 5.7GB (speed optimized) | 36 hours |
dot_product_model |
Dot Product + Resnet | 80ms | 2.6GB | 36 hours |
With larger batches speed increases considerably. With batch size 8 on the non-speed optimized model, the latency drops to ~40ms.
- Simple scan for folks to quickly try the code, instead of downloading the ScanNetv2 test scenes. DONE
- ScanNetv2 extraction,
ETA 10th OctoberDONE - FPN model weights.
[ ] Tutorial on how to use Scanniverse data, ETA 5th October 10th October 20th OctoberAt present there is no publically available way of exporting scans from Scanniverse. You'll have to use ios-logger; NeuralRecon have a good tutorial on this, and a dataloader that accepts the processed format is atdatasets/arkit_dataset.py
. UPDATE: There is now a quick readme data_scripts/IOS_LOGGER_ARKIT_README.md for how to process and run inference an ios-logger scan using the script atdata_scripts/ios_logger_preprocessing.py
.
We've now included two scans for people to try out immediately with the code. You can download these scans from here.
Steps:
- Download weights for the
hero_model
into the weights directory. - Download the scans and unzip them to a directory of your choosing.
- Modify the value for the option
dataset_path
inconfigs/data/vdr_dense.yaml
to the base path of the unzipped vdr folder. - You should be able to run it! Something like this will work:
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/vdr_dense.yaml \
--num_workers 8 \
--batch_size 2 \
--fast_cost_volume \
--run_fusion \
--depth_fuser open3d \
--fuse_color \
--dump_depth_visualization;
This will output meshes, quick depth viz, and socres when benchmarked against LiDAR depth under OUTPUT_PATH
.
This command uses vdr_dense.yaml
which will generate depths for every frame and fuse them into a mesh. In the paper we report scores with fused keyframes instead, and you can run those using vdr_default.yaml
. You can also use dense_offline
tuples by instead using vdr_dense_offline.yaml
.
See the section below on testing and evaluation. Make sure to use the correct config flags for datasets.
Please follow the instructions here to download the dataset. This dataset is quite big (>2TB), so make sure you have enough space, especially for extracting files.
Once downloaded, use this script to export raw sensor data to images and depth files.
We've written a quick tutorial and included modified scripts to help you with downloading and extracting ScanNetv2. You can find them at data_scripts/scannet_wrangling_scripts/
You should change the dataset_path
config argument for ScanNetv2 data configs at configs/data/
to match where your dataset is.
The codebase expects ScanNetv2 to be in the following format:
dataset_path
scans_test (test scans)
scene0707
scene0707_00_vh_clean_2.ply (gt mesh)
sensor_data
frame-000261.pose.txt
frame-000261.color.jpg
frame-000261.color.512.png (optional, image at 512x384)
frame-000261.color.640.png (optional, image at 640x480)
frame-000261.depth.png (full res depth, stored scale *1000)
frame-000261.depth.256.png (optional, depth at 256x192 also
scaled)
scene0707.txt (scan metadata and intrinsics)
...
scans (val and train scans)
scene0000_00
(see above)
scene0000_01
....
In this example scene0707.txt
should contain the scan's metadata and
intrinsics:
colorHeight = 968
colorToDepthExtrinsics = 0.999263 -0.010031 0.037048 ........
colorWidth = 1296
depthHeight = 480
depthWidth = 640
fx_color = 1170.187988
fx_depth = 570.924255
fy_color = 1170.187988
fy_depth = 570.924316
mx_color = 647.750000
mx_depth = 319.500000
my_color = 483.750000
my_depth = 239.500000
numColorFrames = 784
numDepthFrames = 784
numIMUmeasurements = 1632
frame-000261.pose.txt
should contain pose in the form:
-0.384739 0.271466 -0.882203 4.98152
0.921157 0.0521417 -0.385682 1.46821
-0.0587002 -0.961035 -0.270124 1.51837
frame-000261.color.512.png
and frame-000261.color.640.png
are precached resized versions of the original image to save load and compute time during training and testing. frame-000261.depth.256.png
is also a
precached resized version of the depth map.
All resized precached versions of depth and images are nice to have but not required. If they don't exist, the full resolution versions will be loaded, and downsampled on the fly.
By default, we estimate a depth map for each keyframe in a scan. We use DeepVideoMVS's heuristic for keyframe separation and construct tuples to match. We use the depth maps at these keyframes for depth fusion. For each keyframe, we associate a list of source frames that will be used to build the cost volume. We also use dense tuples, where we predict a depth map for each frame in the data, and not just at specific keyframes; these are mostly used for visualization.
We generate and export a list of tuples across all scans that act as the dataset's elements. We've precomputed these lists and they are available at data_splits
under each dataset's split. For ScanNet's test scans they are at data_splits/ScanNetv2/standard_split
. Our core depth numbers are computed using data_splits/ScanNetv2/standard_split/test_eight_view_deepvmvs.txt
.
Here's a quick taxonamy of the type of tuples for test:
default
: a tuple for every keyframe following DeepVideoMVS where all source frames are in the past. Used for all depth and mesh evaluation unless stated otherwise. For ScanNet usedata_splits/ScanNetv2/standard_split/test_eight_view_deepvmvs.txt
.offline
: a tuple for every frame in the scan where source frames can be both in the past and future relative to the current frame. These are useful when a scene is captured offline, and you want the best accuracy possible. With online tuples, the cost volume will contain empty regions as the camera moves away and all source frames lag behind; however with offline tuples, the cost volume is full on both ends, leading to a better scale (and metric) estimate.dense
: an online tuple (like default) for every frame in the scan where all source frames are in the past. For ScanNet this would bedata_splits/ScanNetv2/standard_split/test_eight_view_deepvmvs_dense.txt
.offline
: an offline tuple for every keyframefor every keyframe in the scan.
For the train and validation sets, we follow the same tuple augmentation strategy as in DeepVideoMVS and use the same core generation script.
If you'd like to generate these tuples yourself, you can use the scripts at data_scripts/generate_train_tuples.py
for train tuples and data_scripts/generate_test_tuples.py
for test tuples. These follow the same config format as test.py
and will use whatever dataset class you build to read pose informaiton.
Example for test:
# default tuples
python ./data_scripts/generate_test_tuples.py
--data_config configs/data/scannet_default_test.yaml
--num_workers 16
# dense tuples
python ./data_scripts/generate_test_tuples.py
--data_config configs/data/scannet_dense_test.yaml
--num_workers 16
Examples for train:
# train
python ./data_scripts/generate_train_tuples.py
--data_config configs/data/scannet_default_train.yaml
--num_workers 16
# val
python ./data_scripts/generate_val_tuples.py
--data_config configs/data/scannet_default_val.yaml
--num_workers 16
These scripts will first check each frame in the dataset to make sure it has an existing RGB frame, an existing depth frame (if appropriate for the dataset), and also an existing and valid pose file. It will save these valid_frames
in a text file in each scan's folder, but if the directory is read only, it will ignore saving a valid_frames
file and generate tuples anyway.
You can use test.py
for inferring and evaluating depth maps and fusing meshes.
All results will be stored at a base results folder (results_path) at:
opts.output_base_path/opts.name/opts.dataset/opts.frame_tuple_type/
where opts is the options
class. For example, when opts.output_base_path
is ./results
, opts.name
is HERO_MODEL
,
opts.dataset
is scannet
, and opts.frame_tuple_type
is default
, the output directory will be
./results/HERO_MODEL/scannet/default/
Make sure to set --opts.output_base_path
to a directory suitable for you to store results.
--frame_tuple_type
is the type of image tuple used for MVS. A selection should
be provided in the data_config
file you used.
By default test.py
will attempt to compute depth scores for each frame and provide both frame averaged and scene averaged metrics. The script will save these scores (per scene and totals) under results_path/scores
.
We've done our best to ensure that a torch batching bug through the matching
encoder is fixed for (<10^-4) accurate testing by disabling image batching
through that encoder. Run --batch_size 4
at most if in doubt, and if
you're looking to get as stable as possible numbers and avoid PyTorch
gremlins, use --batch_size 1
for comparison evaluation.
If you want to use this for speed, set --fast_cost_volume
to True. This will
enable batching through the matching encoder and will enable an einops
optimized feature volume.
# Example command to just compute scores
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_default_test.yaml \
--num_workers 8 \
--batch_size 4;
# If you'd like to get a super fast version use:
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_default_test.yaml \
--num_workers 8 \
--fast_cost_volume \
--batch_size 2;
This script can also be used to perform a few different auxiliary tasks, including:
TSDF Fusion
To run TSDF fusion provide the --run_fusion
flag. You have two choices for
fusers
--depth_fuser ours
(default) will use our fuser, whose meshes are used in most visualizations and for scores. This fuser does not support color. We've provided a custom branch of scikit-image with our custom implementation ofmeasure.matching_cubes
that allows single walled. We use single walled meshes for evaluation. If this is isn't important to you, you can set the export_single_mesh toFalse
for call toexport_mesh
intest.py
.--depth_fuser open3d
will use the open3d depth fuser. This fuser supports color and you can enable this by using the--fuse_color
flag.
By default, depth maps will be clipped to 3m for fusion and a tsdf
resolution of 0.04m3 will be used, but you can change that by changing both
--max_fusion_depth
and --fusion_resolution
You can optionnally ask for predicted depths used for fusion to be masked
when no vaiid MVS information exists using --mask_pred_depths
. This is not
enabled by default.
You can also fuse the best guess depths from the cost volume before the
cost volume encoder-decoder that introduces a strong image prior. You can do this by using
--fusion_use_raw_lowest_cost
.
Meshes will be stored under results_path/meshes/
.
# Example command to fuse depths to get meshes
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_default_test.yaml \
--num_workers 8 \
--run_fusion \
--batch_size 8;
Cache depths
You can optionally store depths by providing the --cache_depths
flag.
They will be stored at results_path/depths
.
# Example command to compute scores and cache depths
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_default_test.yaml \
--num_workers 8 \
--cache_depths \
--batch_size 8;
# Example command to fuse depths to get color meshes
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_default_test.yaml \
--num_workers 8 \
--run_fusion \
--depth_fuser open3d \
--fuse_color \
--batch_size 4;
Quick viz
There are other scripts for deeper visualizations of output depths and
fusion, but for quick export of depth map visualization you can use
--dump_depth_visualization
. Visualizations will be stored at results_path/viz/quick_viz/
.
# Example command to output quick depth visualizations
CUDA_VISIBLE_DEVICES=0 python test.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_default_test.yaml \
--num_workers 8 \
--dump_depth_visualization \
--batch_size 4;
We also allow point cloud fusion of depth maps using the fuser from 3DVNet's repo.
# Example command to fuse depths into point clouds.
CUDA_VISIBLE_DEVICES=0 python pc_fusion.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_dense_test.yaml \
--num_workers 8 \
--batch_size 4;
Change configs/data/scannet_dense_test.yaml
to configs/data/scannet_default_test.yaml
to use keyframes only if you don't want to wait too long.
We use TransformerFusion's mesh evaluation for our main results table but set the seed to a fixed value for consistency when randomly sampling meshes. We also report mesh metrics using NeuralRecon's evaluation in the supplemental material.
For point cloud evaluation, we use TransformerFusion's code but load in a point cloud in place of sampling a mesh's surface.
By default models and tensorboard event files are saved to ~/tmp/tensorboard/<model_name>
.
This can be changed with the --log_dir
flag.
We train with a batch_size of 16 with 16-bit precision on two A100s on the default ScanNetv2 split.
Example command to train with two GPUs:
CUDA_VISIBLE_DEVICES=0,1 python train.py --name HERO_MODEL \
--log_dir logs \
--config_file configs/models/hero_model.yaml \
--data_config configs/data/scannet_default_train.yaml \
--gpus 2 \
--batch_size 16;
The code supports any number of GPUs for training.
You can specify which GPUs to use with the CUDA_VISIBLE_DEVICES
environment.
All our training runs were performed on two NVIDIA A100s.
Different dataset
You can train on a custom MVS dataset by writing a new dataloader class which inherits from GenericMVSDataset
at datasets/generic_mvs_dataset.py
. See the ScannetDataset
class in datasets/scannet_dataset.py
or indeed any other class in datasets
for an example.
To finetune, simple load a checkpoint (not resume!) and train from there:
CUDA_VISIBLE_DEVICES=0 python train.py --config configs/models/hero_model.yaml
--data_config configs/data/scannet_default_train.yaml
--load_weights_from_checkpoint weights/hero_model.ckpt
Change the data configs to whatever dataset you want to finetune to.
See options.py
for the range of other training options, such as learning rates and ablation settings, and testing options.
Other than quick depth visualization in the test.py
script, there are two scripts for visualizing depth output.
The first is visualization_scripts/visualize_scene_depth_output.py
. This will produce a video with color images of the reference and source frames, depth prediction, cost volume estimate, GT depth, and estimated normals from depth. The script assumes you have cached depth output using test.py
and accepts the same command template format as test.py
:
# Example command to get visualizations for dense frames
CUDA_VISIBLE_DEVICES=0 python ./visualization_scripts/visualize_scene_depth_output.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--data_config configs/data/scannet_dense_test.yaml \
--num_workers 8;
where OUTPUT_PATH
is the base results directory for SimpleRecon (what you used for test to begin with). You could optionally run .visualization_scripts/generate_gt_min_max_cache.py
before this script to get a scene average for the min and max depth values used for colormapping; if those aren't available, the script will use 0m and 5m for colomapping min and max.
The second allows a live visualization of meshing. This script will use cached depth maps if available, otherwise it will use the model to predict them before fusion. The script will iteratively load in a depth map, fuse it, save a mesh file at this step, and render this mesh alongside a camera marker for the birdseye video, and from the point of view of the camera for the fpv video.
# Example command to get live visualizations for mesh reconstruction
CUDA_VISIBLE_DEVICES=0 python visualize_live_meshing.py --name HERO_MODEL \
--output_base_path OUTPUT_PATH \
--config_file configs/models/hero_model.yaml \
--load_weights_from_checkpoint weights/hero_model.ckpt \
--data_config configs/data/scannet_dense_test.yaml \
--num_workers 8;
By default the script will save meshes to an intermediate location, and you can optionally load those meshes to save time when visualizing the same meshes again by passing --use_precomputed_partial_meshes
. All intermediate meshes will have had to be computed on the previous run for this to work.
This repo uses the notation "cam_T_world" to denote a transformation from world to camera points (extrinsics). The intention is to make it so that the coordinate frame names would match on either side of the variable when used in multiplication:
cam_points = cam_T_world @ world_points
world_T_cam
denotes camera pose (from cam to world coords). ref_T_src
denotes a transformation from a source to a reference view.
This repo is geared towards ScanNet, so while its functionality should allow for any coordinate system (signaled via input flags), the model weights we provide assume a ScanNet coordinate system. This is important since we include ray information as part of metadata. Other datasets used with these weights should be transformed to the ScanNet system. The dataset classes we include will perform the appropriate transforms.
Initially this repo spat out tuple files for default DVMVS style keyframes with 9 extra frame of 25599 for the ScanNetv2 test set. There was a minor bug with handling lost tracking that's now fixed. This repo should now mimic the DVMVS keyframe buffer exactly, with 25590 keyframes for testing. The only effect this bug had was the inclusion of 9 extra frames, all the other tuples were exactly the same as that of DVMVS. The offending frames are in these scans
scan previous count new count
--------------------------------------
scene0711_00 393 392
scene0727_00 209 208
scene0736_00 1023 1022
scene0737_00 408 407
scene0751_00 165 164
scene0775_00 220 219
scene0791_00 227 226
scene0794_00 141 140
scene0795_00 102 101
The tuple files for default test have been updated. Since this is a small (~3e-4) difference in extra frames scored, the scores are unchanged.
We thank Aljaลพ Boลพiฤ of TransformerFusion, Jiaming Sun of Neural Recon, and Arda Dรผzรงeker of DeepVideoMVS for quickly providing useful information to help with baselines and for making their codebases readily available, especially on short notice.
The tuple generation scripts make heavy use of a modified version of DeepVideoMVS's Keyframe buffer (thanks again Arda and co!).
The PyTorch point cloud fusion module at torch_point_cloud_fusion
code is borrowed from 3DVNet's repo. Thanks Alexander Rich!
We'd also like to thank Niantic's infrastructure team for quick actions when we needed them. Thanks folks!
Mohamed is funded by a Microsoft Research PhD Scholarship (MRL 2018-085).
If you find our work useful in your research please consider citing our paper:
@inproceedings{sayed2022simplerecon,
title={SimpleRecon: 3D Reconstruction Without 3D Convolutions},
author={Sayed, Mohamed and Gibson, John and Watson, Jamie and Prisacariu, Victor and Firman, Michael and Godard, Cl{\'e}ment},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2022},
}
Copyright ยฉ Niantic, Inc. 2022. Patent Pending. All rights reserved. Please see the license file for terms.