Official PyTorch implementation of Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception, Xiao Wang, Wentao Wu, Chenglong Li, Zhicheng Zhao, Zhe Chen, Yukai Shi, Jin Tang, AAAI-2024 [arXiv] [Poster]
Understanding vehicles in images is important for various applications such as intelligent transportation and self-driving systems. Existing vehicle-centric works typically pre-train models on large-scale classification datasets and then fine-tune them for specific downstream tasks. However, they neglect the specific characteristics of vehicle perception in different tasks and might thus lead to sub-optimal performance. To address this issue, we propose a novel vehicle-centric pre-training framework called VehicleMAE, which incorporates the structural information including the spatial structure from vehicle profile information and the semantic structure from informative high-level natural language descriptions for effective masked vehicle appearance reconstruction. To be specific, we explicitly ex-tract the sketch lines of vehicles as a form of the spatial structure to guide vehicle reconstruction. The more comprehensive knowledge distilled from the CLIP big model based on the similarity between the paired/unpaired vehicle image-text sample is further taken into consideration to help achieve a better understanding of vehicles. A large-scale dataset is built to pre-train our model, termed Autobot1M, which contains about 1M vehicle images and 12693 text information. Extensive experiments on four vehicle-based downstream tasks fully validated the effectiveness of our VehicleMAE.
- Video Tutorial for this work can be found by clicking the image below:
Configure the environment according to the content of the requirements.txt file.
Baidu Netdisk Link :download
Extracted code :tpds
Pre-trained Model | Vit-base |
---|---|
Pre-trained checkpoint | download |
Extracted code | 6zkx |
#If you pre-training VehicleMAE using a single GPU, please run.
CUDA_VISIBLE_DEVICES=0 python main.py
#If you pre-training VehicleMAE using multiple GPUs, please run.
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py
We used full fine-tuning to test the pre-trained model on four downstream tasks. The results are shown in the table below.
Method |
Dataset |
VAR |
V-Reid |
VFR |
VPS |
||||
mA |
Acc |
F1 |
mAP |
R1 |
Acc |
mIou |
mAcc |
||
Scratch |
- |
84.67 |
80.86 |
84.90 |
35.3 |
57.3 |
24.8 |
49.36 |
59.22 |
MoCov3 |
Imagenet1K |
90.38 |
93.88 |
95.33 |
75.5 |
94.4 |
91.3 |
73.17 |
78.60 |
DINO |
Imagenet1K |
89.92 |
91.09 |
93.11 |
64.3 |
91.5 |
- |
68.43 |
73.37 |
IBOT |
Imagenet1K |
89.51 |
90.17 |
92.37 |
68.9 |
92.6 |
81.1 |
66.03 |
71.06 |
MAE |
Imagenet1K |
89.69 |
93.60 |
95.08 |
76.7 |
95.8 |
91.2 |
69.54 |
75.36 |
MAE |
Autobot1M |
90.19 |
94.06 |
95.43 |
75.5 |
95.4 |
91.3 |
69.00 |
75.36 |
VehicleMAE |
Autobot1M |
92.21 |
94.91 |
96.17 |
85.6 |
97.9 |
94.5 |
73.29 |
80.22 |
The four downstream tasks are vehicle attribute recognition (VAR), vehicle re-identification (V-Reid), vehicle fine-grained recognition (VFR), and vehicle partial segmentation (VPS).
If you find this work helps your research, please cite the following paper and give us a star.
@misc{wang2023structural,
title={Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception},
author={Xiao Wang and Wentao Wu and Chenglong Li and Zhicheng Zhao and Zhe Chen and Yukai Shi and Jin Tang},
year={2023},
eprint={2312.09812},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
if you have any problems with this work, please leave an issue.