Dingyuan Zhang 1,* ,
Dingkang Liang 1,* ,
Zichang Tan 2,
Xiaoqing Ye 2,
Cheng Zhang 1,
Jingdong Wang 2,
Xiang Bai 1,β
1 Huazhong University of Science and Technology,
2 Baidu Inc.
* Equal contribution, β Corresponding author.
This repository represents the official implementation of the paper titled "Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression".
By leveraging history object queries as foreground priors of high quality, modeling 3D motion information in them, and interacting them with image tokens through the attention mechanism, ToC3D can weigh more computing resources to important foreground tokens while compressing the information loss, leading to a more efficient ViT-based multi-view 3D detector
This project is built upon StreamPETR, and the preparations are rougly follow the StreamPETR.
- Follow the StreamPETR setups.
- Install the timm and detectron2.
We use the following environment:
torch 1.10.1+cu111
torchvision 0.11.2+cu111
mmcls 0.25.0
mmcv-full 1.6.0
mmdet 2.28.2
mmdet3d 1.0.0rc6
mmsegmentation 0.30.0
timm 0.9.7
Exactly the same as the StreamPETR data preparation. After the preparation, the data folder should look like:
data/
βββ nuscenes
β βββ maps
β βββ nuscenes2d_temporal_infos_train.pkl
β βββ nuscenes2d_temporal_infos_val.pkl
β βββ samples
β βββ sweeps
β βββ v1.0-trainval
As the baseline is trained with EVA-02 petrained weight, here we need to prepare the weights.
Following the instructions from StreamPETR to download the object365 weights and transform the weights. Finally put the weights into the ckpts/
folder:
ckpts/
βββ eva02_L_coco_det_sys_o365_remapped.pth
Note: the performance of trained models will be influenced by the environments and machines. So we provide our training logs and weights here.
Model | Logs | Weight |
---|---|---|
ToC3D_fast | ToC3D_fast.log | OneDrive |
ToC3D_faster | ToC3D_faster.log | OneDrive |
ToC3D_fast (1600 resolution) | ToC3D_fast_1600.log | OneDrive |
ToC3D_faster (1600 resolution) | ToC3D_faster_1600.log | OneDrive |
The basic commands are the same as the StreamPETR.
Run the following command:
-
ToC3D-Fast & ToC3D-Faster
./tools/dist_test.sh projects/configs/ToC3D/ToC3D_fast.py <ckpt> <num_gpus> --eval mAP # Fast version ./tools/dist_test.sh projects/configs/ToC3D/ToC3D_faster.py <ckpt> <num_gpus> --eval mAP # Faster version
where
<ckpt>
is the path of the checkpoint and<num_gpus>
is the number of gpus used for inference. -
High input resolution (1600 x 800)
./tools/dist_test.sh projects/configs/ToC3D_1600_resolution/ToC3D_fast_1600.py <ckpt> <num_gpus> --eval mAP # Fast version ./tools/dist_test.sh projects/configs/ToC3D_1600_resolution/ToC3D_faster_1600.py <ckpt> <num_gpus> --eval mAP # Faster version
To accurately measure the inference speed, we first warmup the model with 200 samples and then calculate the inference time.
Run the following command:
./tools/dist_test.sh projects/configs/test_speed_ToC3D/stream_petr_eva_vit_l.py <ckpt> 1 --eval mAP # baseline (StreamPETR)
./tools/dist_test.sh projects/configs/test_speed_ToC3D/ToC3D_ratio755.py <ckpt> 1 --eval mAP # Fast version
./tools/dist_test.sh projects/configs/test_speed_ToC3D/ToC3D_ratio543.py <ckpt> 1 --eval mAP # Faster version
Run the following command:
./tools/dist_test.sh projects/configs/token_vis_ToC3D/ToC3D_fast.py ckpts/<ckpt> 1 --eval mAP # Fast version
./tools/dist_test.sh projects/configs/token_vis_ToC3D/ToC3D_faster.py ckpts/<ckpt> 1 --eval mAP # Faster version
The visualization results will be saved at token_vis/
by default. You can specify the number of visualized samples, visualization samples id and output path by changing the config like:
model = dict(
type='Petr3D',
...
vis_num_sample = <number of samples>,
vis_start_id = <id of the first sample>,
vis_out_path = <output path>,
...
)
The basic commands are the same as the StreamPETR.
Our training pipeline contains the following steps:
-
Train the official StreamPETR as the pretrained model.
-
Apply our method to the StreamPETR and then finetune the model with pretrain weights loaded.
-
Train the StreamPETR with ViT-L (load EVA-02 pretrained weights):
./tools/dist_train.sh projects/configs/StreamPETR/stream_petr_eva_vit_l.py 8 --work-dir <path to your work dir> ./tools/dist_train.sh projects/configs/StreamPETR/stream_petr_eva_vit_l_1600.py 8 --work-dir <path to your work dir> # higher input resolution (1600 x 800)
-
Find the weights in
<path to your work dir>
, rename tostreampetr_eva_vit_l_48e.pth
(streampetr_eva_vit_l_1600_24e.pth
for higher input resolution version) and put it into theckpts/
folder.
Since our method are finetuned based on StreamPETR, we also directly finetune the StreamPETR without ToC3D for a fair comparison:
./tools/dist_train.sh projects/configs/baseline_finetuned/stream_petr_eva_vit_l_finetuned.py 8 --work_dir <path to your work dir>
./tools/dist_train.sh projects/configs/baseline_finetuned/stream_petr_eva_vit_l_1600_finetuned.py 8 --work_dir <path to your work dir> # higher input resolution
Run the following command:
./tools/dist_train.sh projects/configs/ToC3D/ToC3D_fast.py 8 --work-dir <path to your work dir> # Fast version
./tools/dist_train.sh projects/configs/ToC3D/ToC3D_faster.py 8 --work-dir <path to your work dir> # Faster version
For higher input resolution, run:
./tools/dist_train.sh projects/configs/ToC3D_1600_resolution/ToC3D_fast_1600.py 8 --work-dir <path to your work dir> # Fast version
./tools/dist_train.sh projects/configs/ToC3D_1600_resolution/ToC3D_faster_1600.py 8 --work-dir <path to your work dir> # Faster version
Note: the performance of trained models will be influenced by the environments and machines. So we provide our training log and weights.
- Release Paper
- Release Code
- Release logs
- Release weights
@article{zhang2024makevitbasedmultiview3d,
title={Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression},
author={Dingyuan Zhang and Dingkang Liang and Zichang Tan and Xiaoqing Ye and Cheng Zhang and Jingdong Wang and Xiang Bai},
booktitle={European Conference on Computer Vision},
year={2024},
}
We thank these great works and open-source codebases: MMDetection3d, StreamPETR, Dynamic ViT, Evo-ViT.