Skip to content

Latest commit

 

History

History
64 lines (43 loc) · 3.97 KB

2023-04.md

File metadata and controls

64 lines (43 loc) · 3.97 KB

April 2023

Paper: G. E. Hinton et. al.: Transforming Auto-encoders (2011)

Link: https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf

A discussion paper about apparent fundamental limitations of CNN layers, and how to overcome it using a new type of layer called "capsule". This is interesting in that it should result in a model that is interpretable by design, but it requires specialized training data, which would be fairly easy to render and generate synthetically, but very difficult to label on real-world pictures. Only provides a high-level and abstract discussion, without going into technical details of implementation, nor providing any concrete results or limitations.

Related (to-explore):

Video: WelcomeAIOverlords: How DeepMind uses Graph Networks for fluid simulation

Link: https://www.youtube.com/watch?v=JSed7OBasXs

Provides a high-level view to get an appreciation for the work. Doesn't go into much technical details, especially about how they efficiently update the graph edges as the system evolves and particles move around so as to facilitate inter-particle interactions. The interview with Jonathan Godwin also remains high-level, but does point out that the model's weakness is in simulating large rigid bodies, as graph networks struggle to pass information between two ends of a large rigid body quickly enough.

Related (to-explore):

Video: 3Blue1Brown: Mathematical foundations of Convolution

Link: https://www.youtube.com/watch?v=KuXjwB4LzSA

A video with great visualizations to get a mathematical appreciation of the subject. Also gives a teaser on how FFT can be used to speed up this operation, with links to other great videos.

Paper: G. E. Hinton et. al.: Knowledge Distillation (2015)

Link: https://arxiv.org/abs/1503.02531

Goal: Transfer knowledge from a larger model (say an ensemble) to a smaller model, effectively and efficiently.

Core idea: A "student" model learns from the "soft" labels generated by a larger "teacher" model, instead of the groundtruth ("hard") labels.

  • Having the student model see a richer view of the similarity structure through soft labels seems to help transfer inductive biases (think: generalizing assumptions) from the teacher to the student.
  • Useful for compressing knowledge from an ensemble model to a single model
  • The student model was observed to learn and generalize well with a fraction of training data (e.g. 3% of original train set) with soft-labels
  • The main limitation of this would be the computational cost of training a large model, generating soft-labels, and the storage cost of storing those soft-labels

Further reading:

Paper: A. Dosovitskiy et. al.: Vision Transformers (ViT)

Link: https://arxiv.org/abs/2010.11929

  • Input image is split into a number of small patches of fixed size
  • Each patch is projected into D dimensension using a trainable linear transformation
  • Alternatively, the image could be passed through CNN layers first and patches could be formed from CNN feature maps
  • The patch embeddings (+position embeddings) are fed into the transformer encoder, prepended by a [CLS] token embedding (similar to BERT, learnable)
  • Model is pre-trained only on classification task (unlike BERT, with 2 tasks)
  • TODO Experiments section needs a better look
  • Can the model handle images of different aspect ratios?

Further reading: