This repo contains a simplified implementation of the very cool 'dense prediction transformer' (DPT) depth estimation model from isl-org/MiDaS, with the intention of removing the magic from the original code. Most of the changes come from eliminating dependencies as well as adjusting the code to more directly represent the model architecture as described in the preprint: "Vision Transformers for Dense Prediction". It also supports Depth-Anything V1 and Depth-Anything V2 models, which use the same DPT structure.
While the focus of this implementation is on readability, there are also some performance improvements with MiDaS v3.1 models (40-60% on my GPU at least) due to caching of positional encodings, at the cost of higher VRAM usage (this can be disabled).
The purpose of this repo is to provide an easy to follow code base to understand how the DPT & image encoder models are structured. The scripts found in the simple_examples folder are a good starting point if you'd like to better understand how to make use of the DPT models. The run_image.py demo script is a good example of a more practical use of the models.
To understand the model structure, there's a written walkthrough explaining each of the DPT components. It's also worth checking out the code implementation of the DPT module, I'd recommended comparing this to the information in the original preprint, particularly figure 1 in the paper.
To get a better sense of what these models are actually doing internally, check out the experiments.
This repo includes two demo scripts, run_image.py and run_video.py. To use these scripts, you'll need to first have Python (v3.10+) installed, then set up a virtual environment and install some additional requirements.
First create and activate a virtual environment (do this inside the repo folder after cloning/downloading it):
# For linux or mac:
python3 -m venv .env
source .env/bin/activate
# For windows (cmd):
python -m venv .env
.env\Scripts\activate.bat
Then install the requirements (or you could install them manually from the requirements.txt file):
pip install -r requirements.txt
Additional info for GPU usage
If you're using Windows and want to use an Nvidia GPU or if you're on Linux and don't have a GPU, you'll need to use a slightly different install command to make use of your hardware setup. You can use the Pytorch installer guide to figure out the command to use. For example, for GPU use on Windows it may look something like:
pip3 uninstall torch # <-- Do this first if you already installed from the requirements.txt file
pip3 install torch --index-url https://download.pytorch.org/whl/cu121
Note: With the Windows install as-is, you may get an error about a missing c10.dll
dependency. Downloading and installing this mysterious .exe file seems to fix the problem.
Before you can run a model, you'll need to download it's weights.
This repo supports the BEiT and SwinV2 models from MiDaS v3.1 which can be downloaded from the isl-org/MiDaS releases page. Additionally, DINOv2 models are supported from Depth-Anything V1 and Depth-Anything V2, which can be downloaded from the LiheYoung/Depth-Anything and Depth-Anything/Depth-Anything-V2 repos on Hugging Face, respectively.
After downloading a model file, you can place it in the model_weights
folder of this repo or otherwise just keep note of the file path, since you'll need to provide this when running the demo scripts. If you do place the file in the model_weights folder, then it will auto-load when running the scripts.
Direct download links
The table below includes direct download links to all of the supported models. Note: These are all links to other repos, none of these files belong to MuggledDPT!
Model | Size (MB) |
---|---|
depth-anything-v2-vit-small | 99 |
depth-anything-v2-vit-base | 390 |
depth-anything-v2-vit-large | 1340 |
depth-anything-v1-vit-small | 99 |
depth-anything-v1-vit-base | 390 |
depth-anything-v1-vit-large | 1340 |
swin2-tiny-256 | 164 |
swin2-base-384 | 416 |
swin2-large-384 | 840 |
beit-base-384 | 456 |
beit-large-384 | 1340 |
beit-large-512 | 1470 |
Here is an example of using the model to generate an inverse depth map from an image:
import cv2
from lib.make_dpt import make_dpt_from_state_dict
# Load image & model
img_bgr = cv2.imread("/path/to/image.jpg")
model_config_dict, dpt_model, dpt_imgproc = make_dpt_from_state_dict("/path/to/model.pth")
# Process data
img_tensor = dpt_imgproc.prepare_image_bgr(img_bgr)
inverse_depth_prediction = dpt_model.inference(img_tensor)
The run_image.py
script will run the depth prediction model on a single image. To use the script, make sure you've activated the virtual environment (from the installation step) and then, from the repo folder use:
python run_image.py
You can also add --help
to the end of this command to see a list of additional flags you can set when running this script. One especially interesting flag is -b
, which allows for processing images at higher resolutions.
If you don't provide an image path (using the -i
flag), then you will be asked to provide one when you run the script, likewise for a path to the model weights. Afterwards, a window will pop-up, with various sliders that can be used to modify the depth visualization. These let you adjust the contrast of the depth visualization, as well as remove a plane-of-best-fit, which can often remove the 'floor' from the depth prediction. You can press s
to save the current depth image.
The run_video.py
script will run the depth prediction model on individual frames from a video. To use the script, again make sure you're in the activated virtual environment and then from the repo folder use:
python run_video.py
As with the image script, you can add --help
to the end of this command to see a list of additional modifiers flags you can set. For example, you can use a webcam as input using the flag --use_webcam
. It's possible to record video results (per-frame) using the --allow_recording
flag.
When processing video, depth predictions are made asynchrounously, (i.e. only when the GPU is ready to do more processing). This leads to faster playback/interaction, but the depth results may appear choppy. You can force synchrounous playback using the -sync
flag or toggling the option within the UI (this also gives more accurate inference timing results).
Note: The original DPT implementation is not designed for consistency across video frames, so the results can be very noisy looking. If you actually need video depth estimation, consider Consistent Depth of Moving Objects in Video and the listed related works.
The DPT models output results which are related to the multiplicative inverse (i.e. 1/d
) of the true depth! As a result, the closest part of an image will have the largest reported value from the DPT model and the furthest part will have the smallest reported value. Additionally, the reported values will not be distributed linearly, which will make the results look distorted if interpretted geometrically (e.g. as a 3D model).
If you happen to know what the true minimum and maximum depth values are for a given image, you can compute the true depth from the DPT result using:
Where dmin and dmax are the known minimum and maximum (respectively) true depth values and Vnorm is the DPT result normalized to be between 0 and 1 (a.k.a the normalized inverse depth).
For more information, please see the results explainer
The code in this repo is based on code from the following sources.
@article {Ranftl2022,
author = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun",
title = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
year = "2022",
volume = "44",
number = "3"
}
@article{Ranftl2021,
author = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ICCV},
year = {2021},
}
@article{birkl2023midas,
title={MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation},
author={Reiner Birkl and Diana Wofk and Matthias M{\"u}ller},
journal={arXiv preprint arXiv:2307.14460},
year={2023}
}
rwightman/pytorch-image-models (aka timm, specifically v0.6.12):
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
LiheYoung/Depth-Anything (v1):
@inproceedings{depth_anything_v1,
title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
booktitle={CVPR},
year={2024}
}
DepthAnything/Depth-Anything-V2:
@article{depth_anything_v2,
title={Depth Anything V2},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
journal={arXiv:2406.09414},
year={2024}
}
@misc{oquab2023dinov2,
title={DINOv2: Learning Robust Visual Features without Supervision},
author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
journal={arXiv:2304.07193},
year={2023}
}
- Inevitable bugfixes