This directory contains the files for training and evaluating vision-language pre-trained model, i.e., VinVL. Most of the files are copied from VinVL github repo. We mainly modified or added the following files:
|- Oscar/
|- oscar/
|- run_gqa_prompt_mlm.py
|- run_gqa_prompt_itm.py
|- run_gqa_prompt_zero_few.py
|- run_vqa_prompt_mlm.py
|- run_vqa_prompt_itm.py
|- utils/
|- task_utils.py
- Refer to INSTALL for installation.
Please follow the steps bellow to configure data:
- Refer to DOWNLOAD
to download the pre-processed GQA dataset.
The downloaded data should contain following files:
|- [DATA_ROOT]/gqa/ |- gqa_bal_qla_train.json |- gqa_bal_qla_val.json |- gqa_all_qla_train.json |- gqa_all_qla_val.json |- gqa_all_qla_submission.json ...
- Download the corresponding declaration files and put them in the
gqa/
directory. The declaration files are downloaded from Baidu Yun (PSW:8888) (data/vinvl/gqa/*_declarative.json
). These files contain declarative sentence per line, which can used for later data-loading and processing. Please put these*_declarative.json
files into thegqa/
directory, resulting in following directory tree:|- [DATA_ROOT]/gqa/ |- gqa_bal_qla_train.json |- gqa_bal_qla_val.json |- gqa_all_qla_train.json |- gqa_all_qla_val.json |- gqa_all_qla_submission.json |- gqa_bal_qla_train_declarative.json # newly added |- gqa_bal_qla_val_declarative.json # newly added |- gqa_all_qla_train_declarative.json # newly added |- gqa_all_qla_val_declarative.json # newly added |- gqa_all_qla_submission_declarative.json # newly added ...
Please follow the steps bellow to configure data:
- Refer to DOWNLOAD
to download the pre-processed VQA v2.0 dataset.
The downloaded data should contain following files:
|- [DATA_ROOT]/vqa/ |- train2014_qla_mrcnn.json |- val2014_qla_mrcnn.json ...
- Download the corresponding declaration files and put them in the
vqa/
directory. The declaration files are downloaded from Baidu Yun (PSW:8888) (data/vinvl/vqa/*_declarative.json
). Please put these*_declarative.json
files into thevqa/
directory, resulting in following directory tree:|- [DATA_ROOT]/vqa/ |- train2014_qla_mrcnn.json |- val2014_qla_mrcnn.json |- train2014_declarative.json # newly added |- val2014_declarative.json # newly added ...
Please refer to DOWNLOAD
to download the pre-trained VinVL base model (checkpoint-2000000
). We also provide the
model checkpoint in Baidu Yun (PSW:8888) (data/model/vinvl/checkpoint-2000000
). Assume that the
checkpoint-2000000
is placed in directory [MODEL_ROOT]
, resulting in [MODEL_ROOT]/checkpoint-2000000/
Please follow the steps bellow to reproduce the results (we take the balanced split for example).
We first utilize the adapted masked language model (MLM) task for GQA fine-tuning:
- Training(MLM): Run the following code to train VinVL-DPT(MLM) on the balanced split:
If successful, the overall accuracy will reach up to ~62.7%. We also provide the fine-tuned model in Baidu Yun (PSW:8888) (
python oscar/run_gqa_prompt_mlm.py \ -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ --data_dir [DATA_ROOT]/gqa/ \ --model_type bert \ --model_name_or_path [MODEL_ROOT]/checkpoint-2000000/ \ --task_name gqa --do_lower_case --max_seq_length 165 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ --eval_data_type bal \ --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \ --gradient_accumulation_steps 2
data/model/vinvl/vinvl_bal_mlm
). - Validation(MLM): Evaluate the fine-tuned model on the GQA validation set using the fine-tuned model we
provide (or the model in the output_dir
gqa_mlm
).Note that thepython oscar/run_gqa_prompt_mlm.py \ -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ --data_dir [DATA_ROOT]/gqa/ \ --model_type bert \ --model_name_or_path data/model/vinvl/vinvl_bal_mlm \ --task_name gqa --do_lower_case --max_seq_length 165 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ --eval_data_type bal \ --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ --logging_steps 4000 --drop_out 0.3 --do_val --weight_decay 0.05 --warmup_steps 0 \ --gradient_accumulation_steps 2
model_name_or_path
and--do_val
arguments have been changed compared to training stage. - Testing and Submission(MLM): Test the fine-tuned model and submit the result file
to the online evaluation website:
Run the following code:
Note that the
python oscar/run_gqa_prompt_mlm.py \ -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ --data_dir [DATA_ROOT]/gqa/ \ --model_type bert \ --model_name_or_path data/model/vinvl/vinvl_bal_mlm \ --task_name gqa --do_lower_case --max_seq_length 165 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ --eval_data_type bal \ --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ --logging_steps 4000 --drop_out 0.3 --do_test --weight_decay 0.05 --warmup_steps 0 \ --gradient_accumulation_steps 2
--do_test
argument has been changed compared to validation stage.
Then, we apply the adapated image-text matching (ITM) task to solve VQA problem. To achieve this, we need to obtain the top-k candidate answer predicted by MLM tasks. Specifically, we pre-generate the prediction results of MLM task:
- Pre-generate topk results for training and validation.
python oscar/run_gqa_prompt_mlm.py \ -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ --data_dir [DATA_ROOT]/gqa/ \ --model_type bert \ --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \ --task_name gqa --do_lower_case --max_seq_length 165 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ --eval_data_type bal \ --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ --logging_steps 4000 --drop_out 0.3 --do_train --do_generate --weight_decay 0.05 --warmup_steps 0 \ --gradient_accumulation_steps 2
- Pre-generate topk results for submission.
python oscar/run_gqa_prompt_mlm.py \ -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ --data_dir [DATA_ROOT]/gqa/ \ --model_type bert \ --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \ --task_name gqa --do_lower_case --max_seq_length 165 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 6 --output_dir gqa_mlm \ --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ --eval_data_type bal \ --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ --logging_steps 4000 --drop_out 0.3 --do_test --do_generate --weight_decay 0.05 --warmup_steps 0 \ --gradient_accumulation_steps 2
Note that the --do_generate
argument has been added. In this way, there will be two result files
saved in model_name_or_path
, i.e., stage1.pkl
, stage1_eval.pkl
, and stage1_submission.pkl
. The files have following
data format:
{
"[QID]": (np.ndarray([topk, ], np.int16), # Topk answer indices
np.ndarray([topk, ], np.float16),), # Topk answer scores
...
}
We also provide the result files in the fine-tuned checkpoint
vinvl_bal_mlm
.
- Training(ITM): Equipped with the pre-generated topk answers, we can apply ITM by running following
code:
python oscar/run_gqa_prompt_itm.py \ -j 4 --img_feature_dim 2054 --max_img_seq_length 45 \ --data_dir [DATA_ROOT]/gqa/ \ --model_type bert \ --model_name_or_path data/model/vinvl/vinvl_bal_mlm/ \ --task_name gqa --do_lower_case --max_seq_length 165 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \ --label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \ --img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \ --eval_data_type bal \ --label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \ --loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \ --logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \ --gradient_accumulation_steps 2
Note that we need to load the checkpoint from MLM task. We also provide the checkpoint in
Baidu Yun (PSW:8888) (data/model/vinvl/vinvl_bal_itm/
).
5. Validation(ITM): Once the model is fine-tuned via ITM, we can validate the model
through following code:
python oscar/run_gqa_prompt_itm.py \
-j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
--data_dir [DATA_ROOT]/gqa/ \
--model_type bert \
--model_name_or_path data/model/vinvl/vinvl_bal_itm/ \
--task_name gqa --do_lower_case --max_seq_length 165 \
--per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
--learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \
--label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
--img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
--eval_data_type bal \
--label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
--loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
--logging_steps 4000 --drop_out 0.3 --do_val --weight_decay 0.05 --warmup_steps 0 \
--gradient_accumulation_steps 2
Note that the pre-generate result files, i.e., stage1.pkl
, stage1_eval.pkl
, and stage1_submission.pkl
should be copied to data/model/vinvl/vinvl_bal_itm/
so that the code has the access to the
MLM results.
6. Testing and Submission(ITM): (Please make sure that the stage1_submission.pkl
has been
pre-generated or downloaded, and placed in the model_name_or_path
.) Run the following code to run testing:
python oscar/run_gqa_prompt_itm.py \
-j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
--data_dir [DATA_ROOT]/gqa/ \
--model_type bert \
--model_name_or_path data/model/vinvl/vinvl_bal_itm/ \
--task_name gqa --do_lower_case --max_seq_length 165 \
--per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \
--learning_rate 5e-05 --num_train_epochs 2 --output_dir gqa_itm \
--label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
--img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
--eval_data_type bal \
--label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
--loss_type xe --save_epoch 1 --seed 88 --evaluate_during_training \
--logging_steps 4000 --drop_out 0.3 --do_test --weight_decay 0.05 --warmup_steps 0 \
--gradient_accumulation_steps 2
Please follow the steps bellow to reproduce the results on VQA v2.0:
We first utilize the masked language model (MLM) task to fine-tune the model:
- Training(MLM): Run the following code to train VinVL-DPT(MLM):
We also provide the checkpoint in Baidu Yun (PSW:8888) (
python oscar/run_vqa_prompt_mlm.py -j 4 \ --img_feature_dim 2054 --max_img_seq_length 50 \ --data_label_type mask --img_feature_type faster_r-cnn \ --data_dir [DATA_ROOT]/vqa --model_type bert \ --model_name_or_path [MODEL_ROOT]/checkpoint-2000000 \ --task_name vqa_text --do_train --do_lower_case --max_seq_length 158 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 25 \ --output_dir vqa_mlm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \ --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \ --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \ --img_feat_format pt --classifier linear --cls_hidden_scale 3 \ --txt_data_dir [DATA_ROOT]/vqa
data/model/vinvl/vqa_mlm/
). Then, we pre-generate the topk results of MLM task via following code:Note thatpython oscar/run_vqa_prompt_mlm.py -j 4 \ --img_feature_dim 2054 --max_img_seq_length 50 \ --data_label_type mask --img_feature_type faster_r-cnn \ --data_dir [DATA_ROOT]/vqa --model_type bert \ --model_name_or_path data/model/vinvl/vqa_mlm/ \ --task_name vqa_text --do_train --do_generate --do_lower_case --max_seq_length 158 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 25 \ --output_dir vqa_mlm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \ --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \ --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \ --img_feat_format pt --classifier linear --cls_hidden_scale 3 \ --txt_data_dir [DATA_ROOT]/vqa
model_name_or_path
anddo_generate
arguments have been changed. In this way, two result files are generated and saved inmodel_name_or_path
, i.e.,stage1.pkl
andstage1_eval.pkl
. - Training(ITM): Run the following code to train image-text matching (ITM) task for VQA:
We also provide the fine-tuned checkpoint in Baidu Yun (PSW:8888) (
python oscar/run_vqa_prompt_itm.py -j 4 \ --img_feature_dim 2054 --max_img_seq_length 50 \ --data_label_type mask --img_feature_type faster_r-cnn \ --data_dir [DATA_ROOT]/vqa --model_type bert \ --model_name_or_path data/model/vinvl/vqa_mlm/ \ --task_name vqa_text --do_train --do_lower_case --max_seq_length 158 \ --per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 32 \ --learning_rate 5e-05 --num_train_epochs 6 \ --output_dir vqa_itm --label_file [DATA_ROOT]/vqatrainval_ans2label.pkl \ --save_epoch 1 --seed 88 --evaluate_during_training --logging_steps 4000 \ --drop_out 0.3 --weight_decay 0.05 --warmup_steps 0 --loss_type bce \ --img_feat_format pt --classifier linear --cls_hidden_scale 3 \ --txt_data_dir [DATA_ROOT]/vqa
data/model/vinvl/vqa_itm/
).
In zero-shot and few-shot settings, zero or only a few samples (1~128) are used to fine-tune the
model. Run the following code to split [K]
-shot training set for fine-tuning, and evaluate on the
whole validation set.
python oscar/run_gqa_prompt_zero_few.py \
-j 4 --img_feature_dim 2054 --max_img_seq_length 45 \
--data_dir [DATA_ROOT]/gqa/ \
--model_type bert \
--model_name_or_path [MODEL_ROOT]/checkpoint-2000000/ \
--task_name gqa --do_lower_case --max_seq_length 165 \
--per_gpu_eval_batch_size 32 --per_gpu_train_batch_size 1 \
--learning_rate 5e-05 --num_train_epochs 25 --output_dir gqa_subset \
--label_file [DATA_ROOT]/gqa/trainval_testdev_all_ans2label.pkl \
--img_feature_type faster_r-cnn --data_label_type all --train_data_type bal \
--eval_data_type bal \
--label2ans_file [DATA_ROOT]/gqa/trainval_testdev_all_label2ans.pkl \
--loss_type xe --save_epoch 10 --seed 88 --evaluate_during_training \
--logging_steps 4000 --drop_out 0.3 --do_train --weight_decay 0.05 --warmup_steps 0 \
--gradient_accumulation_steps 1 \
--num_examples [K] --subset_seed 0