-
Clone the repository by using
git clone
-
Download the required dataset (
Flickr8k
) from here -
Extract the dataset, and put it into the repository folder
-
The repository contains various files, execute
model.py
for creating, training the model by usingpython model.py
4.1 Update variables, that contain paths to different files and folders ( details mentioned in files ) in
data_process.py
,encoder.py
. -
For generating captions, execute
generate_caption.py
by usingpython generate_caption.py
5.1 Update the path of image for which caption is to be generated in
generate_caption.py
Being able to automatically describe the content of an image using properly formed English sentences is a very challenging task, but it could have great impact, for instance by helping visually impaired people better understand the content of images on the web. The description should not only contain the objects present in the image, but also how these objects relate to each other, and the activities they are involved in. Hence a language is also required other than the visual understanding of the image.
This paper provides a joint model that takes an image I as input, and is trained to maximize the likelihood p(S|I) of producing a target sequence of words S = {S1, S2, . . .} where each word St comes from a given dictionary, that describes the image adequately.
The inspiration for this paper was recent advancements in Machine Language Translation, where we train a Recurrent Neural Network, based on encoder
and decoder
. The encoder extracts the information (in a vector
) of the input sentence ( say in English language ), and then the decoder predicts the target sentence ( say in French language ) based on the encoder output vector.
In machine translation, state-of-art results can be directly obtained by maximizing the probability of correct translation, given the input sequence in end to end fashion both at training and inference time. RNN's are used to encode variable length sequence inputs to a fixed dimension vector ( which stores the semantic information of the sequence ) and to decode this vector to generate the desired output sequence.
Following this approach, the captions can be decoded from the image as :
where θ are the parameters of our model, I is an image, and S its correct transcription. Since S represents any sentence, its length is unbounded. Thus, it is common to apply the chain rule to model the joint probability over S0, . . . , SN , where N is the length of this particular example as :
It is natural to model p(St|I, S0, . . . , St−1) with a Recurrent Neural Network (RNN), where the variable number of words we condition upon up to t − 1 is expressed by a fixed length hidden state or memory ht. We choose to implement LSTM ( Long-Short Term Memory ) net, which has its own state-of-art performance for such NLP problems.
For image representation we use Convolutional Neural Networks. They are widely used and are currently state-of-art for object recognition and detection tasks. In this paper we will use pre-trained model InceptionV3 (paper can be found here) to extarct various features of the images. We will neglect the last layer of the model ( which is softmax prediction used in image classification task ) because we need a dense vector of image storing all the information of the image.
In this implementation pre-trained, non-trainable model on InceptionV3 is used.
For generating the desired output sequence, we use a particular form of Recurrent Neural Networks called Long-Short Term Memory (LSTM's) net.
The LSTM model is trained to predict each word of the sentence after it has seen the image as well as all preceding words as defined by p(St|I, S0, . . . , St−1) . The image is sent into the model as initial_states
of LSTM. There are 2 initial_states in LSTM cell, hidden state ( used for softmaxe prediction ) and a cell state ( used for memorizing old data ).
- Encoding the image, to get a vector representation of the image.
- Obtaining a dense vector from encoder output with dimensions matching the LSTM hidden states dimensions.
- Initialising decoder's initial state as the dense vector containing image information.
During training, the input data and output data of decoder differs in one time-step, so that given a sequence, decoder should predict the next word in the sequence.
This model has been trained on kaggle (notebook can be found here), for 25 epochs, and it took approximately 30 seconds per epoch to train (using GPU).
The loss function varies with epochs as follows:
Predicted Caption : man wearing a red headband and a leather outfit standing outside
Predicted Caption : girl and a boy and a dog are walking next to a dog seen from a dock
Predicted Caption : are another man on a boat