Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering paper
This repository is the official implementation of CMSA-MTPT
for the visual question answering task in medical domain. Our model achieved 56.1 for open-ended and 77.3 for close-end on VQA-RAD dataset. Up to 2021-5-28, the proposed models achieves the SOTA
on the VQA-RAD dataset. For the detail, please refer to link.
The main contributer of this code is Guanqi Chen link. This repository is based on and inspired by @Jin-Hwa Kim's work and @Aizo-ai's work. We sincerely thank for their sharing of the codes.
Please cite this paper in your publications if it helps your research
@inproceedings{gong2021cross,
author = {Haifan Gong and
Guanqi Chen and
Sishuo Liu and
Yizhou Yu and
Guanbin Li},
title = {Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical
Visual Question Answering},
booktitle = {{ICMR} '21: International Conference on Multimedia Retrieval, Taipei,
Taiwan, August 21-24, 2021},
pages = {456--460},
publisher = {{ACM}},
year = {2021},
doi = {10.1145/3460426.3463584},
}
You may also cite this work if it helps your research
@article{gong2022vqamix,
title={VQAMix: Conditional Triplet Mixup for Medical Visual Question Answering},
author={Haifan Gong and Guanqi Chen and Mingzhi Mao and Zhen Li and Guanbin Li},
journal={IEEE Trans. on Medical Imaging},
year={2022}
}
Note: You should replace the original imagenet pretrained encoder with the multi-task pretrained encoder in the drive or trained by yourself !!!
Overview of the proposed medical VQA model. Our method consists of four components (with different colors in the figure): image feature extractor, question encoder, cross-modal self-attention (CMSA) module, and answer predictor.
Multi-Task Pre-Training: the model is jointly trained with an image understanding task and a questionimage compatibility task. Depending on the dataset-specific image understanding task, the decoder can be selected as a fully convolutional network or a fully connected network.
torch 1.0.1 torchvision 0.4.0a0 numpy 1.19.1 cuda 9.1 gpu GTX1080
The processed data should be downloaded via link with the extract code: tkm8
. The downloaded file should be extracted to data_RAD/
directory.
The pretrained models is available at Baidu Drive with extract code: 163k
Or Google Drive.
The dataset for multi-task pretraining is available at Baidu Drive with extract code gow6
Or Google Drive
Just run the train.sh
and the test.sh
for training and evaluation.
The result json file can be found in the directory results/
.
MIT License
The up to date result could refer to https://github.com/haifangong/VQAMix If you have any problem, no hesitate contact us at haifangong@outlook.com HCP Lab Homepage: https://www.sysuhcp.com/