Skip to content
Alexandru Dinu edited this page Jan 4, 2019 · 27 revisions

The models are inspired from [1],[2]. They are just experiments I have tried and do not fully implement the architectures described in the aforementioned papers.

Training

The input consists of 720p images from the YouTube-8M dataset (credit goes to gsssrao for the downloader and frame generator scripts). The entire dataset that was used for training consists of 121,827 frames.

The images are padded to 1280x768 (i.e. 24,24 height pad), so that they can be split into 60 128x128 patches. The model only gets to see a singular patch per forward pass (i.e. there are 60 forward passes and optimization steps for an image). The loss is computed (per patch) as MSELoss(orig_patch_ij, out_patch_ij), and we have an average loss per image.

Models and architecture

Model Latent size Compressed size
cae_16x8x8_zero_pad_bin 16x8x8 7.5KB
cae_16x8x8_refl_pad_bin 16x8x8 7.5KB
cae_16x16x16_zero_pad_bin 16x16x16 30KB
cae_32x32x32_zero_pad_bin 32x32x32 240KB

All models implement stochastic binarization [2], that is, the encoded representation is in binary format. The number of bits per patch is given by the latent shape, and the compressed size will be 60 * bits_per_patch / 8 / 1024 KB.

The benefits of stochastic binarization, as mentioned in [2], are:

(1) bit vectors are trivially serializable/deserializable for image transmission over the wire,

(2) control of the network compression rate is achieved simply by putting constraints on the bit allowance

(3) a binary bottleneck helps force the network to learn efficient representations compared to standard floating-point layers, which may have many redundant bit patterns that have no effect on the output.

Since the best performing model is cae_32x32x32_zero_pad_bin, we only describe its (high-level) architecture. Output shapes and block descriptions can be found in the code.

encoder:
x => conv1 --> conv2 --> enc_block1 --> (+) --> enc_block2 --> (+) --> enc_block3 --> (+) --> conv3 (tanh) => enc
                 |-----------------------^   |------------------^   |------------------^

decoder:
enc => up_conv1 --> dec_block1 --> (+) --> dec_block2 --> (+) --> dec_block3 --> (+) --> up_conv2 --> up_conv3 (tanh) => x
           |------------------------^   |------------------^   |------------------^

Note: Currently, smoothing (i.e. linear interpolation in smoothing.py) is used in order to account for the between-patches noisy areas due to padding (this still needs further investigation).

References

[1] https://arxiv.org/abs/1703.00395

[2] http://arxiv.org/abs/1511.06085

Clone this wiki locally