Skip to content

Latest commit

 

History

History
34 lines (24 loc) · 2 KB

README.md

File metadata and controls

34 lines (24 loc) · 2 KB

Image-Captioning

An image captioning model by combining a pre-trained VGG-16 image encoder with LSTM-based language decoder.

Here I build an image captioning model by combining a pre-trained VGG-16 image encoder from with LSTM-based language decoder from.

This is an implementation of the following paper:

Show and tell: A neural image caption generator, O. Vinyals, A. Toshev, S. Bengio, D. Erhan, CVPR, 2015.

which requires the steps detailed below.

1. Setup Image Encoder

Here I load pre-trained VGG-16 model with the weights trained on ImageNet. I also get rid of softmax, so I will end up with fc2 layer producing 4096 feature encoding for a given image i.

2. Setup Language Decoder

Here I will start with the language decoder model. I need to pass image encoding as the hidden state input into the first LSTM cell (i.e., h0 = xi). However, this would only work if the hidden state is dimension of the 4096, which is way too high dimensional. In order to get a more reasonably dimensional representation I insert a linear layer to project from 4096 to 300.