This is the code for our paper by the same name. Link in the title.
This Project was done for Stanford's CS 224N and CS 230.
Our model architecture is inspired by the winning entry of the 2017 VQA Challenge.
Which follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge".
MIT
This Project uses code provided here
We used the preprocessing and base code provided by the above link and then performed an extensive architecture and hyperparameter search.
Model | Validation Accuracy | Training Time |
---|---|---|
Reported Model | 63.15 | 12 - 18 hours (Tesla K40) |
Our A3x2 Model | 64.78 | 4 hours AWS g3.8xlarge instance (2x M60) |
The accuracy was calculated using the VQA evaluation metric.
This is part of a project done for Stanford's CS 224N and CS 230.
Check out our paper for the full implemetation details and hyperparamter search. ArXiv link coming soon.
Make sure you are on a machine with a NVIDIA GPU and Python 2.7+ with about 70 GB disk space.
All data should be downloaded to a data/ directory in the root directory of this repository.
The easiest way to download the data is to run the provided script tools/download.sh
from the repository root. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh
from the repository root to process the data to the correct format.
Simply run python main.py
to start training. The default model run is the best performing A3x2. Other model variations can be run using the models flag. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.
Certain Pretrained models availible upon request.
Please use the Citation found at:
http://dblp.uni-trier.de/rec/bibtex/journals/corr/abs-1803-07724