This deliverable details the results of task T2.4, in which we have compiled and trained a library of models for enriching images with meta-data using recent advances in state-of-the-art deep learning techniques.
Deep-learning is a subset of machine learning techniques (Bishop 2006) which aims to perform simultaneous feature extraction and model training from labelled or unlabelled examples. Typically models are either feedforward or recurrent neural networks consisting of multiple layers of alternating linear-transformation followed by non-linear activation functions, hence the term "deep" (Goodfellow et al. 2016).
The roots of deep-learning can be traced back to the early days of computer science, with the first multi-layer neural networks going back to at least as early as Ivakhnenko & Lapa (1967). However it has not been until recent years that advances in data collection, parallel architectures and computational power have enabled large scale deep-learning, performing tasks such as image recognition (Krizhevsky et al. 2012), language modeling (Sutskever et al. 2011) and machine translation (Sutskever et el. 2014) close to human-level performance.
More formally, given input and output data
Such that
It is possible to prove that a simple "two-layered" perceptron is sufficient to express the multitude of all such functions, given conditions on differentiability and compactness of the domain of the function (Cybenko 1989):
However in principle, any number of "layers" may be applied to approach the task of finding a map between the two spaces:
A text-book application of deep-learning, and indeed machine learning, is computer vision (Goodfellow et al. 2016). Computer vision is the research field which pursues the goal (among others) of replicating the functionality of the human visual system using computer modelling and programs. In the early days of computer vision, methods were often based on painstakingly hand-crafted feature engineering using heuristics and extensive trial and error (Lowe 1999). Krizhevsky et al. (2012) showed that these features may be learnt more easily, achieving vastly superior performance, in a data driven way using neural networks with several important innovations over traditional neural networks:
- Many data points
$$(x_i, y_i)$$ should be seen in training. This number of data points may only be handled by a "stochastic gradient descent algorithm" and parallel computation on a graphical processing unit (GPU), with data-parallelism over "mini-batches" of data points. - Many more layers of alternating linear functions and non-linearities should be learnt than the traditional two or three layers.
- Explosion in the number of parameters should be restricted through the extensive use of "convolutional" layers in the initial stages of the neural network, i.e. tiled layerwise spatial filtering with learnt filter coefficients and expansion of the RGB channel space.
- Gradients should be forced to be well behaved through these layers using a "rectified linear-unit" non-linearity:
$$\sigma(x) = \text{max}(0, x)$$ - Available data should be augmented using random affine transformations to avoid overfitting on the available training data.
In analogy to computer vision, natural language processing (NLP) is the research field which pursues the goal (among others) of replicating human linguistic ability using computer modelling and programs. As for computer vision, in the early days of NLP, methods were often based on painstakingly hand-crafted feature engineering using heuristics and extensive trial and error ([Liddy (2001)). Work on deep learning showed that these features may be learnt more easily, achieving superior performance, in a data driven way using recurrent neural networks (RNNs). Critical here are the following features in learning:
- Some technique must be used to combat vanishing/ exploding gradients encountered in e.g. traditional Elman RNNs. The prototypical method to achieve this is using a gated recurrent architecture, such as the LSTM (Hochreiter & Schidhüber 2001).
- The use of attentional mechanisms is known to improve performance and learning (Graves et al. 2014).
- As in computer vision, mini-batch learning with stochastic gradient descent must be used to combat the large dataset sizes.
- Parallelism across mutliple GPUs is necessary to tackle the large size of linguistic datasets.
- Various dropout techniques (ensembling methods applied to neural networks) may be applied in the case of smaller dataset sizes.
Fashion is in many senses the prototypical and paradigmatic application of deep learning and computer vision in industry. The essence of fashion products is typically only captured by high resolution photos (or indeed videos), with companies such as Zalando SE (FB partner) displaying as many as seven high quality images of products as well as short video snippets of models demonstrating the products. The desideratum for deep-learning powered computer vision, that data be plentifully available, is additional exemplified in e-commerce companies such as Zalando SE, insofar as typically of the order of
The business case for deep-learning in fashion stems from the need to process and categorize this vast amount of product data and glean algorithms to further recommend products to suitable customers, intelligently index product data for search and perform high level tasks typically performed at great cost by domain experts, such as outfit construction.
In this deliverable we provide a schematic codebase and illustrative applications of deep-learning applied to fashion images. The final months of the work package will be used applying this schematic to the data obtained in T2.1 and T2.2. Using data from the historical Zalando catalogue already provided in the Fashion Project, we have trained a deep neural networks to perform:
-
Colour classification
-
Brand classification
-
Clothing type classification
-
Image annotation
Our computer vision architecture is a classical 50 layered residual network ("resnet50") (He et al. 2015). Residual networks have become the accepted architecture for performing image classification in computer vision due to their smooth gradients, accurate performance and modularity. See the image below for a schematic of a 34-layered residual network ("resnet34").
For image annotation we use a gated recurrent unit (GRU) architecture, with initial hidden state initialized by the output of a resnet50:
Code is available here for download: https://github.com/zalandoresearch/fb-model-library
A typical call to train a model image library takes the form:
python classify.py \
--mode train \
--model_type <brand/color/pool> \
--checkpoint <save-path> \
--lr <learning-rate> \
--n_epochs <number-of-epochs> \
> <save-path>/log
Visualization of training-progress as displayed in the log file may be invoked by:
python classify.py \
--mode plot \
--checkpoint <save-path>
The trained model may be tested by:
python classify.py \
--mode test \
--checkpoint <save-path>
At the highest level, the classification model, which is written in the Pytorch framework, is summarised as follows:
class Classify(torch.nn.Module):
def __init__(self, n_classes, finetune=False):
torch.nn.Module.__init__(self)
self.cnn = resnet()
self.linear = torch.nn.Linear(2048, n_classes)
self.finetune = finetune
def forward(self, x):
if self.finetune:
return self.linear(self.cnn.features(x))
else:
return self.linear(self.cnn.features(x).detach())
Hence the nuts and bolts of the convolutional neural network are contained in the function resnet
.
Training proceeds by:
python annotate.py \
--mode train \
--catalog <dataset> \
--checkpoint <save-path> \
--lr <learning-rate> \
--n_epochs <number-of-epochs> \
> <save-path>/log
The core model in this task is:
class Annotator(torch.nn.Module):
def __init__(self, n_vocab):
torch.nn.Module.__init__(self)
self.rnn = torch.nn.GRU(64, 512, 1, batch_first=False)
self.embed = torch.nn.Embedding(n_vocab, 64)
self.project = torch.nn.Linear(512, n_vocab)
self.init = torch.nn.Linear(2048, 512)
def forward(self, x, y, hidden=None):
if hidden is None:
hidden = self.init(x)
hidden = hidden[None, :, :]
hidden, _ = self.rnn(self.embed(y), hidden)
return self.project(hidden), hidden
def annotate(self, x, start, finish):
last = torch.LongTensor([start])
y = [start]
while True:
if len(y) == 1:
out, hidden = self(x[None, :], last[None, :])
else:
out, hidden = self(x[None, :], last[None, :], hidden)
last = out.squeeze().topk(1)[1]
y.append(last.item())
if y[-1] == finish:
break
return y
Bishop, C. M. (2006). Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg. [url]
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press. [url]
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on (Vol. 2, pp. 1150-1157)
Ivakhnenko, A. G., & Lapa, V. G. (1967). Cybernetics and forecasting techniques. [url]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). [url]
Sutskever, I., Martens, J., & Hinton, G. E. (2011). Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 1017-1024). [url]
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). [url]
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314. [url]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). [url]
Liddy, E. D. (2001). Natural language processing.
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv preprint arXiv:1410.5401.