This repository provides PyTorch code implementation for DCTNet: Depth-Cooperated Trimodal Network for Video Salient Object Detection [Arxiv]
News: The new RDVS dataset, whose depth of videos is realistic, is coming!!!
Overview of DCTNet. (a) shows the big picture. (b) and (c) show the details of MAM and RFM, respectively.
Effectiveness of leveraging depth to assist VSOD. OF denotes optical flow, and GT represents ground truth.
Column (e) and (f) are predictions from our full model (with depth) and its variant (without depth), respectively.
- Python 3.7
- Pytorch 1.6.0
- Torchvision 0.7.0
- Cuda 9.2
- Ubuntu16.04
- Download the pre_trained ResNet34 backbone to './model/resnet/pre_train/'.
- Download the train dataset (containing DAVIS16, DAVSOD, FBMS and DUTS-TR) from Baidu Driver (PSW: 7yer) and save it at './dataset/train/*'.
- Following instructions of RAFT to prepare the optical flow and instructions of DPT to prepare the synthetic depth map.(Both optical flow map and synthetic depth map are also available from our dataset link)
- Download the pre_trained RGB, depth and flow stream models from Baidu Driver (PSW: 8lux) to './checkpoints/'.
- The training of entire DCTNet is implemented on two NVIDIA TiTAN X GPUs.
- run
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py
in terminal
- run
- The pretraining code of different streams can be derived from
train.py
. We providepretrain_depth.py
and it can also be modified for pretraining the other two streams. What is more, for RGB/spatial stream, we additionally use DUTS-TR for pretraining.
-
Download the test data (containing DAVIS16, DAVSOD, FBMS, SegTrack-V2, VOS) from Baidu Driver (PSW: 8uh3) and save it at './dataset/test/*'
-
Download the trained model from Baidu Driver (PSW: lze1) and modify the
model_path
to its saving path in thetest.py
. -
Run
python test.py
in the terminal.
In addition to the one reported in the paper, we also provide different versions with two more training dataset-combos, including DAVIS + FBMS, and DAVIS + DAVSOD.
Models with "*" are traditional methods, MGAN and FSNet are trained and fine-tuned on DAVIS and FBMS. The comparison results are below. Download the trained model from Baidu Driver (PSW: l3q2)
Datasets | Metrics | MSTM* | STBP* | SFLR* | SCOM* | MGAN | FSNet | Ours |
---|---|---|---|---|---|---|---|---|
DAVIS | maxF | 0.395 | 0.485 | 0.698 | 0.746 | 0.893 | 0.907 | 0.912 |
S-measure | 0.566 | 0.651 | 0.771 | 0.814 | 0.913 | 0.920 | 0.924 | |
MAE | 0.174 | 0.105 | 0.060 | 0.055 | 0.022 | 0.020 | 0.014 | |
DAVSOD | maxF | 0.347 | 0.408 | 0.482 | 0.473 | 0.662 | 0.685 | 0.691 |
S-measure | 0.530 | 0.563 | 0.622 | 0.603 | 0.757 | 0.773 | 0.782 | |
MAE | 0.214 | 0.165 | 0.136 | 0.219 | 0.079 | 0.072 | 0.068 | |
FBMS | maxF | 0.500 | 0.595 | 0.660 | 0.797 | 0.909 | 0.888 | 0.909 |
S-measure | 0.613 | 0.627 | 0.699 | 0.794 | 0.912 | 0.890 | 0.916 | |
MAE | 0.177 | 0.152 | 0.117 | 0.079 | 0.026 | 0.041 | 0.024 | |
SegTrack-V2 | maxF | 0.526 | 0.640 | 0.745 | 0.764 | 0.840 | 0.806 | 0.826 |
S-measure | 0.643 | 0.735 | 0.804 | 0.815 | 0.895 | 0.870 | 0.887 | |
MAE | 0.114 | 0.061 | 0.037 | 0.030 | 0.024 | 0.025 | 0.034 | |
VOS | maxF | 0.567 | 0.526 | 0.546 | 0.690 | 0.743 | 0.659 | 0.764 |
S-measure | 0.657 | 0.576 | 0.624 | 0.712 | 0.807 | 0.703 | 0.831 | |
MAE | 0.144 | 0.163 | 0.145 | 0.162 | 0.069 | 0.103 | 0.061 |
SSAV,PCSA and TENet are trained and fine-tuned on DAVIS and DAVSOD. The comparison results are below. Download the trained model from Baidu Driver (PSW: srwu)
Datasets | Metrics | MSTM* | STBP* | SFLR* | SCOM* | SSAV | PCSA | TENet | Ours |
---|---|---|---|---|---|---|---|---|---|
DAVIS | maxF | 0.395 | 0.485 | 0.698 | 0.746 | 0.861 | 0.880 | 0.894 | 0.904 |
S-measure | 0.566 | 0.651 | 0.771 | 0.814 | 0.893 | 0.902 | 0.905 | 0.917 | |
MAE | 0.174 | 0.105 | 0.060 | 0.055 | 0.028 | 0.022 | 0.021 | 0.016 | |
DAVSOD | maxF | 0.347 | 0.408 | 0.482 | 0.473 | 0.603 | 0.656 | 0.648 | 0.695 |
S-measure | 0.530 | 0.563 | 0.622 | 0.603 | 0.724 | 0.741 | 0.753 | 0.778 | |
MAE | 0.214 | 0.165 | 0.136 | 0.219 | 0.092 | 0.086 | 0.078 | 0.069 | |
FBMS | maxF | 0.500 | 0.595 | 0.660 | 0.797 | 0.865 | 0.837 | 0.887 | 0.883 |
S-measure | 0.613 | 0.627 | 0.699 | 0.794 | 0.879 | 0.868 | 0.910 | 0.886 | |
MAE | 0.177 | 0.152 | 0.117 | 0.079 | 0.040 | 0.040 | 0.027 | 0.032 | |
SegTrack-V2 | maxF | 0.526 | 0.640 | 0.745 | 0.764 | 0.798 | 0.811 | ** | 0.839 |
S-measure | 0.643 | 0.735 | 0.804 | 0.815 | 0.851 | 0.866 | ** | 0.886 | |
MAE | 0.114 | 0.061 | 0.037 | 0.030 | 0.023 | 0.024 | ** | 0.014 | |
VOS | maxF | 0.567 | 0.526 | 0.546 | 0.690 | 0.742 | 0.747 | ** | 0.772 |
S-measure | 0.657 | 0.576 | 0.624 | 0.712 | 0.819 | 0.828 | ** | 0.837 | |
MAE | 0.144 | 0.163 | 0.145 | 0.162 | 0.074 | 0.065 | ** | 0.058 |
- The saliency maps can be download from Baidu Driver (PSW: wfqc)
- Evaluation Toolbox: We use the standard evaluation toolbox from DAVSOD benchmark.
We have constructed a new RGB-D VSOD dataset, whose depth is realistic, rather synthesized. See the links for the new RDVS dataset and the paper.
Please cite our paper if you find this work useful:
@inproceedings{lu2022depth,
title={Depth-Cooperated Trimodal Network for Video Salient Object Detection},
author={Lu, Yukang and Min, Dingyao and Fu, Keren and Zhao, Qijun},
booktitle={2022 IEEE International Conference on Image Processing (ICIP)},
year={2022},
organization={IEEE}
}