Harnessing Diffusion Models for Visual Perception with Meta Prompts,
Qiang Wan, Zilong Huang, Bingyi Kang, Jiashi Feng, Li Zhang
- ⏳ Pose estimation training code and model.
Jan. 31th, 2024
: Release semantic segmentation training code and model.Jan. 6th, 2024
: Release depth estimation training code and model.
Clone this repo, and run
sh install.sh
Download the checkpoint of stable-diffusion (we use v1-5
by default) and put it in the checkpoints
folder.
MetaPrompts obtains 0.223 RMSE on NYUv2 depth estimation benchmark and 1.929 RMSE on KITTI Eigen split, establishing the new state-of-the-art.
NYUv2 | RMSE | d1 | d2 | d3 | REL |
---|---|---|---|---|---|
MetaPrompts | 0.223 | 0.976 | 0.997 | 0.999 | 0.061 |
KITTI | RMSE | d1 | d2 | d3 | REL |
---|---|---|---|---|---|
MetaPrompts | 1.928 | 0.981 | 0.998 | 1.000 | 0.047 |
Please check depth.md for detailed instructions on training and inference.
MetaPrompts obtains 56.8 mIoU on ADE20K semantic segmentation benchmark and 87.3 mIoU on CityScapes, establishing the new state-of-the-art.
ADE20K | Head | Crop Size | mIoU | mIoU (ms+flip) |
---|---|---|---|---|
MetaPrompts | Upernet | 512x512 | 55.83 | 56.81 |
CityScapes | Head | Crop Size | mIoU | mIoU (ms+flip) |
---|---|---|---|---|
MetaPrompts | Upernet | 1024x1024 | 85.98 | 87.26 |
Please check segmentation.md for detailed instructions on training and inference.
MIT License
This code is based on stable-diffusion, mmsegmentation, LAVT, VPD, ViTPose, mmpose, and MIM-Depth-Estimation.
If you find our work useful in your research, please consider citing:
@article{wan2023harnessing,
title={Harnessing Diffusion Models for Visual Perception with Meta Prompts},
author={Wan, Qiang and Huang, Zilong and Kang, Bingyi and Feng, Jiashi and Zhang, Li},
journal={arXiv preprint arXiv:2312.14733},
year={2023}
}