This repo contains codes for our paper "Rethinking Data Augmentation for Robust Visual Question Answering".
We followed CSS-VQA to finish our codes, many thanks!
Make sure you are on a machine with a NVIDIA GPU and about 100 GB disk space.
Python 3.6 with
h5py == 3.1.0
torch == 1.9.0
click == 7.1.2
numpy == 1.19.2
tqdm == 4.60.0
transformers == 4.8.2
clip == 1.0
- Download data
bash tools/download.sh
- Download faster rcnn features
Download feature1.zip
and feature2.zip
from Google Drive, then unzip and merge them into data/rcnn_feature/
.
- Download Images (For CLIP-based Filtering)
Create Images
folder and download coco images.
train2014:http://images.cocodataset.org/zips/train2014.zip
val2014:http://images.cocodataset.org/zips/train2014.zip
- Process data
bash tools/process.sh
Data processing results may be in-consistent due to the inconsistency of python versions. To use our pretrained models, you can download the process results from here. Move them to folder data
.
- Download extra data to train CSS (ID & OOD Teacher in KDDAug)
Download *hintscore.json
files from here, and move them to data
folder.
-
Create
aug_data
folder to save augmented data. -
For convenience, process original dataset by following steps:
- Prepare Original IQA Triplets
- Prepare Faster RCNN Detection Data
- Extract Nouns of Question.
Run command:
python process_original_dataset.py --dataset cpv2
python process_original_dataset.py --dataset v2
Example data after processing:
{
# IQA triplets
'q_id': 9001,
'img_id': 9,
'question': 'What color are the dishes?',
'answer_text': ['pink and yellow'],
'scores': [0.9],
# Faster RCNN Detection Results
'objects': ['broccoli', 'donut', 'container', 'meat', 'container', 'bowl', 'food'],
'attributes': ['green', '', 'plastic', '', '', '', ''],
# Meaningful nouns in Question
'nouns': ['dish']
}
- Extract question features (For generate Paraphrasing questions).
CUDA_VISIBLE_DEVICES=0 python extract_question_feature.py --dataset cpv2
CUDA_VISIBLE_DEVICES=0 python extract_question_feature.py --dataset v2
- Extract CLIP features for images (For CLIP-based Filtering).
CUDA_VISIBLE_DEVICES=0 python extract_clip_feature.py --dataset cpv2
CUDA_VISIBLE_DEVICES=0 python extract_clip_feature.py --dataset v2
- Yes/No Questions.
python generate_yesno.py --dataset cpv2
python generate_yesno.py --dataset v2
- Other Questions
python generate_other.py --dataset cpv2
python generate_other.py --dataset v2
- Color Questions
python generate_color.py --dataset cpv2
python generate_color.py --dataset v2
- Number Questions
python generate_number.py --dataset cpv2
python generate_number.py --dataset v2
- Paraphrasing Questions
CUDA_VISIBLE_DEVICES=0 python generate_paraphrasing.py --dataset cpv2
CUDA_VISIBLE_DEVICES=0 python generate_paraphrasing.py --dataset v2
CUDA_VISIBLE_DEVICES=0 python divide.py --dataset cpv2 --ratio 1.0
CUDA_VISIBLE_DEVICES=0 python divide.py --dataset v2 --ratio 1.0
ratio
denotes high-quality ratio.
Notice: even if ratio
set to 1.0
, the code still generate low_quality_dataset.pkl
file.
- Pretrain a teacher model (CSS) Download from CSS-VQA or train a new LMH-CSS model using the command:
CUDA_VISIBLE_DEVICES=0 python main.py --dataset [cpv2/v2] --mode q_v_debias --debias learned_mixin --topq 1 --topv -1 --qvp 5 --output lmh_css --seed 2048
- Assign new answer.
# number
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name number --split high --teacher_path []
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name number --split low --teacher_path []
# other
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name other --split high --teacher_path []
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name other --split low --teacher_path []
# color
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name color --split high --teacher_path []
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name color --split low --teacher_path []
# paraphrasing
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name paraphrasing --split high --teacher_path []
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name paraphrasing --split low --teacher_path []
# yesno
CUDA_VISIBLE_DEVICES=0 python assign_answer.py --dataset [cpv2/v2] --name yesno --split low --teacher_path []
Merge all augmented data and save to [cpv2/v2]_all_aug_dataset.pkl
.
python merge.py --dataset [cpv2/v2]
CLIP-based filtering and save to [cpv2/v2]_total_aug_dataset.pkl
CUDA_VISIBLE_DEVICES=0 python filter.py --ratio 0.1 --dataset [cpv2/v2]
- Train Backbone models
UpDn
Run command:
CUDA_VISIBLE_DEVICES=0 python main.py --dataset cpv2 --mode updn --debias none --output updn --seed 0
or download our pretrained UpDn
model from here
LMH-CSS+
Download our pretrained LMH-CSS+
model from here
- Finetune on Augmented dataset
Use [cpv2/v2]_all_aug_dataset.pkl
if aug_name
set to all
.
Use [cpv2/v2]_total_aug_dataset.pkl
(after clip-based filtering) if aug_name
set to total
.
CUDA_VISIBLE_DEVICES=0 python aug_main.py --backbone ./path/to/model --aug_name all --dataset cpv2 --output [] --seed 0
Our KDDAug model is available here