Exploring Maximum-Mean-Discrepancy-Based Descriptive Neural Image Style Transfer Using PyTorch

How to run

./download_models.sh provided by [9].
./download_images.sh
python3 neural_image_style_transfer.py built upon [9].

Abstract

Since Gatys et al. proposed an image style transfer algorithm using convolution neural network (CNN) in 2015 ([1.1], [1.2]), many proceeding works extend their method and produce various results. Jing et al ([2]) suggested that these methods be divided into 3 categories: Maximum-Mean-Discrepancy(MMD)-based Descriptive Methods, Markov-Random-Field(MRF)-based Descriptive Methods, and Generative Methods. In this project we focus on MMD-based methods, which achieves style transfer by gradient descent on an output image. We reproduce the result of [1.1], implement the idea "activation shift" in [3] to improve transfering quality, and then give furthur insights into the arguments in [4] by both mathematical proofs and experiments.

Summary of Gatys et al.'s method

We briefly review the algorithm in [1.1] in order to set up the mathematical notations, which is mainly adapted from [4] and [5], and will be used throughout this project.

Given a content image x_c and a style image x_s, we would like to generate an output image x_o consisting of the content of x_c and the style of x_s. x_o shares the same height and width with x_c and is initialized to contain the same pixel values as x_c. Let x_c, x_s, and x_o be passed through a pretrained VGG19 CNN ([6]). We denote the heights of feature maps of x_c, x_s, and x_o in layer l of the CNN by h_c^l, h_s^l, and h_o^l, respectively, and the widths are denoted by w_c^l, w_s^l, and w_o^l. Let m_z^l = h_z^l × w_z^l, ∀ z ∈ {c, s, o}. Also let n^l be the number of filters in layer l. We then denote the rearranged feature maps of x_c, x_s, and x_o in layer l by matrices P^l ∈ M_{n^l×m_c^l}, S^l ∈ M_{n^l×m_s^l}, and F^l ∈ M_{n^l×m_o^l}, respectively. P^l_{i,(w_c^l×k+q+1)}, 0 ≤ k < h_c^l and 0 ≤ q < w_c^l, is the value of (k+1)-th row and (q+1)-th column of i-th feature map of x_c in layer l. S^l and F^l are viewed similarly. We minimize the loss function L = L_c + L_s to achieve style transfer by back propagation through CNN and gradient descent on x_o for hundreds of epochs, where
L_c = Σ_l ^{a^l}⁄_{n^lm_c^l} Σ_i=1^{n^l} Σ_k=1^{m_c^l} (F^l_ik - P^l_ik)² is the content loss, and
L_s = Σ_l ^{b^l}⁄_(n^l)² Σ_i=1^{n^l} Σ_j=1^{n^l} (G^l_ij - A^l_ij)² is the style loss, where
A^l = ¹⁄_{m_s^l}(S^l)(S^l)^T and G^l = ¹⁄_{m_o^l}(F^l)(F^l)^T are averaged Gramian matrices, and
a^l and b^l are user-specified weights.

In the context below, we omit the superscript l when not needed.

The meaning of Gramian matrices is clearly pointed out in [7], section 1: A_ij = ¹⁄_{m_s} Σ_k=1^m_s S_ikS_jk indicates how often the i-th and j-th features appears together. G is understood similarly.

The content loss takes care of "where" a feature appears in a feature map. On the other hand, the style loss emphasizes "how often" distinct features appear together but does not care "where" they co-occur.

Remarks

Though [1.1] assumes that the heights and weights of x_c and x_s are identical, they are allowed to differ. There is no such requirement in [5].
In literature, Gramian matrix comes in two forms: A = SS^T (others) and A = S^TS ([5]). The former is more consistent with APIs of deep learning libraries while the latter is the standard form ([10]). We use the former one.

Images

In this project we use 3 content images and 3 style images as inputs and get 3 × 3 = 9 output images in each experiment.

Content images ([C.1][C.2][C.3])

Style images ([S.1][S.2][S.3])

Part 0: Reproduce Gatys et al.'s results

Gatys kindly provides his implementation of [1.1] on [9]. All the extensions below are built upon Gatys' code.

In this part, we simply run Gatys' code to generate style-transferred images. It can be observed that there is "ghosting" in some results, as stated in [8].

Part 1: Remove ghosting using activation shift

Risser et al. argue that ghosting occurs because there are multiple sets of pixel values that can produce alike feature maps when passed through CNN, and some of the sets looks like ghosting ([8]). Novak et al. also give a related argument in [3], section 3.3 and suggest that using "activation shift" can reduce the ambiguity of candidate sets of pixel values. Their modification is that: instead of letting
A = ¹⁄_{m_s}SS^T and G = ¹⁄_{m_o}FF^T,
let
A = ¹⁄_{m_s}(S + sU)(S + sU)^T and G = ¹⁄_{m_o}(F + sU)(F + sU)^T, where
s is a user-specified scalar and U the all one matrix (with size varying to meet the need).

The explanation of this modification is provided in [3], which we are not going to restate here. Here we simply implement it and examine its performance with s varying. It can be ovserved that the ghosting disappears when |s| increases. Activation shift removes ghosting.

Gramian matrix with activation shift. Value of s from top to bottom: -600, -500, -400, -300, -200, -100, 0, 100, 200, 300, 400, 500, 600.

Part 2: On theoretical part of [4]

Li et al. and Risser et al. regard each column S._k of S and F._k of F as generated from "style" probability distributions D_s and D_o, respectively ([4][8]). Minimizing the Gramian-matrix-based style loss L_s is a way to match D_o to D_s.

Li et al. furthur argue that minimizing L_s can be interpreted as minimizing MMD with a quadratic kernel. We slightly modify their proof and present it here.

¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ (G_ij - A_ij)²
= ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ ((¹⁄_{m_o}FF^T)_ij - (¹⁄_{m_s}SS^T)_ij)²
= ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ ((¹⁄_{m_o} Σ_k=1^m_o F_ikF_jk) - (¹⁄_{m_s} Σ_k=1^m_s S_ikS_jk))²
= ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ ((¹⁄_{m_o} Σ_k=1^m_o F_ikF_jk)² + (¹⁄_{m_s} Σ_k=1^m_s S_ikS_jk)² - 2(¹⁄_{m_o} Σ_k=1^m_o F_ikF_jk)(¹⁄_{m_s} Σ_k=1^m_s S_ikS_jk))
= ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ ((¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o F_ik₁F_jk₁F_ik₂F_jk₂) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s S_ik₁S_jk₁S_ik₂S_jk₂) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s F_ik₁F_jk₁S_ik₂S_jk₂))
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ F_ik₁F_jk₁F_ik₂F_jk₂) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ S_ik₁S_jk₁S_ik₂S_jk₂) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s ¹⁄_n² Σ_i=1ⁿ Σ_j=1ⁿ F_ik₁F_jk₁S_ik₂S_jk₂)
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o (¹⁄_n Σ_i=1ⁿ F_ik₁F_ik₂)²) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s (¹⁄_n Σ_i=1ⁿ S_ik₁S_ik₂)²) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s (¹⁄_n Σ_i=1ⁿ F_ik₁S_ik₂)²)
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o (¹⁄_n F._k₁^TF._k₂)²) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s (¹⁄_n S._k₁^TS._k₂)²) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s (¹⁄_n F._k₁^TS._k₂)²)
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o K(F._k₁, F._k₂, 2)) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s K(S._k₁, S._k₂, 2)) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s K(F._k₁, S._k₂, 2)) -------- (1), where
K(v, u, p) = (¹⁄_nv^Tu)^p is the averaged power kernel.

Theoretically we can use any positive number as the power p of the kernel, but in practice, we need to compute S^TS and F^TF, which are of size m_s² and m_o², in order to compute MMD. This exhausts memory. Thus, we need to convert the MMD side to the Gramian side if we want to use a power kernel with another p. We now show that MMD with kernel with integer power are easy to convert. The following is the general version of equation (1).

Let p be a nonnegative integer.
(¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o K(F._k₁, F._k₂, p)) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s K(S._k₁, S._k₂, p)) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s K(F._k₁, S._k₂, p))
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o (¹⁄_n F._k₁^TF._k₂)^p) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s (¹⁄_n S._k₁^TS._k₂)^p) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s (¹⁄_n F._k₁^TS._k₂)^p)
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o (¹⁄_n Σ_i=1ⁿ F_ik₁F_ik₂)^p) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s (¹⁄_n Σ_i=1ⁿ S_ik₁S_ik₂)^p) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s (¹⁄_n Σ_i=1ⁿ F_ik₁S_ik₂)^p)
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o ¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ (Π_q=1^p F_{i_qk₁})(Π_q=1^p F_{i_qk₂})) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s ¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ (Π_q=1^p S_{i_qk₁})(Π_q=1^p S_{i_qk₂})) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s ¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ (Π_q=1^p F_{i_qk₁})(Π_q=1^p S_{i_qk₂}))
= ¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ ((¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o (Π_q=1^p F_{i_qk₁})(Π_q=1^p F_{i_qk₂})) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s (Π_q=1^p S_{i_qk₁})(Π_q=1^p S_{i_qk₂})) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s (Π_q=1^p F_{i_qk₁})(Π_q=1^p S_{i_qk₂})))
= ¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ ((¹⁄_{m_o} Σ_k=1^m_o Π_q=1^p F_{i_qk})² + (¹⁄_{m_s} Σ_k=1^m_s Π_q=1^p S_{i_qk})² - 2(¹⁄_{m_o} Σ_k=1^m_o Π_q=1^p F_{i_qk})(¹⁄_{m_s} Σ_k=1^m_s Π_q=1^p S_{i_qk}))
= ¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ ((¹⁄_{m_o} Σ_k=1^m_o Π_q=1^p F_{i_qk}) - (¹⁄_{m_s} Σ_k=1^m_s Π_q=1^p S_{i_qk}))² -------- (2).

Equation (2) can be interpreted as: MMD with kernel with power p is equivalent to mean squared error of frequencies of co-occurrences of all possible permutations of p features. The case with p = 3 is also mentioned in [3], section 4.5.

Although we can implement the generalized Gramian side for all nonnegative integer p, it still runs out of memory for p ≥ 2. Unfortunately, only p = 1 and 2 (using Gramian matrices) are practical.

For p = 1, we can furthur simplify equation (2).
¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ ((¹⁄_{m_o} Σ_k=1^m_o Π_q=1^p F_{i_qk}) - (¹⁄_{m_s} Σ_k=1^m_s Π_q=1^p S_{i_qk}))²
= ¹⁄_n Σ_i=1ⁿ ((¹⁄_{m_o} Σ_k=1^m_o F_ik) - (¹⁄_{m_s} Σ_k=1^m_s S_ik))²
= ¹⁄_n Σ_i=1ⁿ (mean(F_i.) - mean(S_i.))² -------- (3).
MMD with linear kernel is equivalent to mean squared error of frequencies of occurrence of every feature.
This is much more easier to compute than what equation (2) suggests. We call this approach "mean vector" as opposed to the original Gramian matrix.

Part 3: Experiment of part 2

In this part we show that the style loss using mean vectors does capture some aspects of the style.

Mean vector

Part 4: Link between activation shift and MMD

By substituting each variable z in equation (2) for (z + sU), we have
¹⁄_n^p Σ_i₁=1ⁿ Σ_i₂=1ⁿ ... Σ_{i_p=1}ⁿ ((¹⁄_{m_o} Σ_k=1^m_o Π_q=1^p (F_{i_qk} + sU)) - (¹⁄_{m_s} Σ_k=1^m_s Π_q=1^p (S_{i_qk} + sU)))²
= (¹⁄_{m_o²} Σ_k₁=1^m_o Σ_k₂=1^m_o K(F._k₁ + sU, F._k₂ + sU, p)) + (¹⁄_{m_s²} Σ_k₁=1^m_s Σ_k₂=1^m_s K(S._k₁ + sU, S._k₂ + sU, p)) - (²⁄_{m_om_s} Σ_k₁=1^m_o Σ_k₂=1^m_s K(F._k₁ + sU, S._k₂ + sU, p)) -------- (4).

Letting p = 2, we find that the LHS is the style loss using Gramian matrices with activation shift s, and the RHS is MMD with "shifted" quadratic kernel. Thus we can reinterpret the conclusion in part 1: style loss using MMD with shifted quadratic kernel removes ghosting.

Part 5: Style loss using variance vectors and covariance matrices

In part 2, we mention that minimizing the Gramian-matrix-based style loss is a way to match D_o to D_s. There are other ways to describe D_o and D_s and therefore are other ways to match them. In this part, we try two descriptions: variance vector and covariance matrix, as shown in the following.

L_s = Σ_l ^{b^l}⁄_n^l Σ_i=1^{n^l} (var(F_i.) - var(S_i.))², and
L_s = Σ_l ^{b^l}⁄_(n^l)² Σ_i=1^{n^l} Σ_j=1^{n^l} (cov(D_o)_ij - cov(D_s)_ij)², respectively.

We now examine their performance. It can be observed that the style loss using covariance matrices creates great results without ghosting while thie one using variance vectors results in ghosting easily. It can also be seen that some styles are easy to transfer well (introducing less ghosting) while some are not. For example, Vincent van Gogh's "the starry night" is an relatively easier one.

Variance vector

Covariance matrix

Remarks

We can also add activation shifts to mean vectors, variance vectors, and covariance matrices. But if we write down the mathematical formula of style losses and do some calculation, we will find that the activation shifts cancel out eventually. Thus activation shifts make sense only for Gramian matrices.
Style losses using linear combinations ([4], section 4.2) of mean vectors, Gramian matrices, variance vectors, and covariance matrices are possible, but the goal of this project is to examine the ability of each probability description to transfer the style, so we will not do such tests.

Discussion

We see in the experiments above that there are two style losses creating great results: the one using shifted Gramian matrics and the one using covariance matrices. We regard the latter as a more elegant way since in the former, we need to decide one more argument, namely, the activation shift s. However, there are some cases in which the former outperforms the latter.

When doing this project, we find that striking the balance among the weights, a_l and b_l, of losses is a nontrivial work. There is no solid theory that guides us to tune them, and we feel that the balance is just occasionally struck. This may be why people seek other approaches such as feed-forward methods to achieve neural image style transfer.

References

Papers

[1.1] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. 2016.
[1.2] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. 2015. Preprint version of [1.1].
[2] Y. Jing, Y. Yang, Z. Feng, J. Ye, and M. Song. Neural Style Transfer: A Review. 2017.
[3] R. Novak and Y. Nikulin. Improving the neural algorithm of artistic style. 2016.
[4] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. 2017.
[5] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual factors in neural style transfer. 2016.
[6] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014.
[7] Y. Nikulin and R. Novak. Exploring the neural algorithm of artistic style. 2016.
[8] E. Risser, P. Wilmot, and C. Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. 2017.

Source code

[9] PyTorch implementation of [1.1] by Gatys. https://github.com/leongatys/PytorchNeuralStyleTransfer

Source images

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Images		Images
Results		Results
README.md		README.md
download_images.sh		download_images.sh
download_models.sh		download_models.sh
neural_image_style_transfer.py		neural_image_style_transfer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Maximum-Mean-Discrepancy-Based Descriptive Neural Image Style Transfer Using PyTorch

How to run

Abstract

Summary of Gatys et al.'s method

Remarks

Images

Content images ([C.1][C.2][C.3])

Style images ([S.1][S.2][S.3])

Part 0: Reproduce Gatys et al.'s results

Part 1: Remove ghosting using activation shift

Gramian matrix with activation shift. Value of s from top to bottom: -600, -500, -400, -300, -200, -100, 0, 100, 200, 300, 400, 500, 600.

Part 2: On theoretical part of [4]

Part 3: Experiment of part 2

Mean vector

Part 4: Link between activation shift and MMD

Part 5: Style loss using variance vectors and covariance matrices

Variance vector

Covariance matrix

Remarks

Discussion

References

Papers

Source code

Source images

Content images

Style images

Other references

About

Releases

Packages

Languages

GeoZcx/Exploring-Maximum-Mean-Discrepancy-Based-Descriptive-Neural-Image-Style-Transfer-Using-PyTorch

Folders and files

Latest commit

History

Repository files navigation

Exploring Maximum-Mean-Discrepancy-Based Descriptive Neural Image Style Transfer Using PyTorch

How to run

Abstract

Summary of Gatys et al.'s method

Remarks

Images

Content images ([C.1][C.2][C.3])

Style images ([S.1][S.2][S.3])

Part 0: Reproduce Gatys et al.'s results

Part 1: Remove ghosting using activation shift

Gramian matrix with activation shift. Value of s from top to bottom: -600, -500, -400, -300, -200, -100, 0, 100, 200, 300, 400, 500, 600.

Part 2: On theoretical part of [4]

Part 3: Experiment of part 2

Mean vector

Part 4: Link between activation shift and MMD

Part 5: Style loss using variance vectors and covariance matrices

Variance vector

Covariance matrix

Remarks

Discussion

References

Papers

Source code

Source images

Content images

Style images

Other references

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages