- Introduction
- Dataset
- Model Architectures
- Loss Functions
- Metrics
- Common Hyper-parameters
- SRCNN
- SRResNet
This a comparative study of different models on the computer vision task of super resolution. Super-resolution is the of upscaling a low resolution image into a high resolution one by a factor of
I used the Oxford-IIIT Pet Dataset to train our models.
I used two separate model architectures for this:
- A vanilla Super Resolution Convolutional Neural Network (SRCNN).
- A Super Resolution Residual Network (SRResNet).
I paired each of them up wtih two three different loss functions to compare the results. Them being:
- MSE Loss
- Perceptual Loss with a pre-trained VGG-16
- MSE Loss with Weighted Perceptual Loss.
Finally the metrics I used to evaluate these models:
- Peak Signal to Noise Ratio (PSNR)
- Structural Similarity Index SSIM (SSIM).
I used the Oxford-IIIT Pet Dataset for training the models.
Model Architecture:
Bicubic-Interpolation
Sequential(
(0): Conv2d(3, 64, kernel_size=(9, 9), stride=(1, 1), padding=(4, 4))
(1): Conv2d(64, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(32, 3, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(4): Sigmoid()
)
Model summary with (1,3,300,300)
input:
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
SRCNN [1, 3, 600, 600] --
├─Sequential: 1-1 [1, 3, 600, 600] --
│ └─Conv2d: 2-1 [1, 64, 600, 600] 15,616
│ └─Conv2d: 2-2 [1, 32, 600, 600] 51,232
│ └─Conv2d: 2-3 [1, 32, 600, 600] 9,248
│ └─Conv2d: 2-4 [1, 3, 600, 600] 2,403
│ └─Sigmoid: 2-5 [1, 3, 600, 600] --
==========================================================================================
Total params: 78,499
Trainable params: 78,499
Non-trainable params: 0
Total mult-adds (G): 28.26
==========================================================================================
Model Architecture:
(0): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(1): Sequential(
(0): ResidualBlock(
(conv1): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(conv2): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1))
(relu): ReLU(inplace=True)
)
(1): ResidualBlock(
(conv1): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(conv2): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1))
(relu): ReLU(inplace=True)
)
(2): ResidualBlock(
(conv1): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(conv2): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1))
(relu): ReLU(inplace=True)
)
(3): ResidualBlock(
(conv1): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1))
(conv2): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1))
(relu): ReLU(inplace=True)
)
)
(2): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Sequential(
(0): SubPixelConv(
(conv): Conv2d(32, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu): ReLU(inplace=True)
(pixle_shuffle): PixelShuffle(upscale_factor=2)
)
)
(4): Conv2d(32, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Sigmoid()
)
Model summary with (1,3,300,300)
input:
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
SRResNet [1, 3, 600, 600] --
├─Sequential: 1-1 [1, 3, 600, 600] --
│ └─Conv2d: 2-1 [1, 64, 300, 300] 4,864
│ └─Sequential: 2-2 [1, 64, 300, 300] --
│ │ └─ResidualBlock: 3-1 [1, 64, 300, 300] 4,192
│ │ └─ResidualBlock: 3-2 [1, 64, 300, 300] 4,192
│ │ └─ResidualBlock: 3-3 [1, 64, 300, 300] 4,192
│ │ └─ResidualBlock: 3-4 [1, 64, 300, 300] 4,192
│ └─Conv2d: 2-3 [1, 32, 300, 300] 18,464
│ └─Sequential: 2-4 [1, 32, 600, 600] --
│ │ └─SubPixelConv: 3-5 [1, 32, 600, 600] 36,992
│ └─Conv2d: 2-5 [1, 3, 600, 600] 867
│ └─Sigmoid: 2-6 [1, 3, 600, 600] --
==========================================================================================
Total params: 77,955
Trainable params: 77,955
Non-trainable params: 0
Total mult-adds (G): 7.25
==========================================================================================
Note: I used a
I made the models have roughly the same amount parameters to have a fairer comparison.
Vanilla pixel-wise mean-squared loss.
Feature wise mean-squared loss extracted from different layers of a pretrained classification model.
$$\mathcal{L}{Perceptual}=\sum{l}\frac{1}{h^{(l)}\times\ w^{(l)}\times c^{(l)}}||\mathbf{f}_r^{(l)}-\mathbf{f}_u^{(l)}||^2_2$$ where,
$$\mathcal{L}{Perceptual-MSE}=\mathcal{L}{MSE}+\lambda\mathcal{L}_{Perceptual}$$ where,
PSNR quantifies the level of distortion or noise introduced in the image processing pipeline, providing insight into the perceptual quality experienced by viewers.
Unlike traditional metrics, SSIM is designed to provide a more perceptually relevant measure by taking into account changes in structural information, luminance, and contrast.
-
$\mu_{\mathbf{x}}$ and$\mu_{\mathbf{y}}$ are the average intensities of$\mathbf{x}$ and$\mathbf{y}$ . -
$\sigma_{\mathbf{x}}$ and$\sigma_{\mathbf{y}}$ are variances of$\mathbf{x}$ and$\mathbf{y}$ . -
$\sigma_{\mathbf{x}\mathbf{y}}$ is covariance of$\mathbf{x}$ and$\mathbf{y}$ .
Before delving into the performance of each model, let's look at some of the common hyper-parameters of our models.
- Upscale Factor:
2
- Image size:
(None,3,300,300)
. Performed random crop and padding to our images to achieve similar size. - Optimizer:
Adam
- Learning Rate:
1e-4
- Epochs:
10
- Batch Size:
32
- Train, Val, Test Split:
0.8,0.1,0.1
- PSNR:
26.1032 dB
- SSIM:
0.7731
- Perceptual Model:
VGG-16
- Feature Extraction Layers:
(3,8,15,29)
as mentioned here.
- PSNR:
21.6850 dB
- SSIM:
0.7605
-
Perceptual Model:
VGG-16
-
Feature Extraction Layers:
(3,8,15,29)
-
Perceptual Loss Weight (
$\lambda$ ) :0.001
. Took this value so that MSE Loss and Perceptual Loss can have approximately weight.
- PSNR:
26.0233 dB
- SSIM:
0.7797
- PSNR:
28.2696 dB
- SSIM:
0.7989
- Perceptual Model:
VGG-16
- Feature Extraction Layers:
(3,8,15,29)
- PSNR:
22.7536 dB
- SSIM:
0.5199
Note
-
Perceptual Model:
VGG-16
-
Feature Extraction Layers:
(3,8,15,29)
-
Perceptual Loss Weight (
$\lambda$ ) :0.001
- PSNR:
28.8381 dB
- SSIM:
0.8170