Link: https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf
A discussion paper about apparent fundamental limitations of CNN layers, and how to overcome it using a new type of layer called "capsule". This is interesting in that it should result in a model that is interpretable by design, but it requires specialized training data, which would be fairly easy to render and generate synthetically, but very difficult to label on real-world pictures. Only provides a high-level and abstract discussion, without going into technical details of implementation, nor providing any concrete results or limitations.
Related (to-explore):
- Paper: V. Mazzia et. al.: Efficient-CapsNet (2021)
- Article: Kalman filter's use in vehicle position estimation
Link: https://www.youtube.com/watch?v=JSed7OBasXs
Provides a high-level view to get an appreciation for the work. Doesn't go into much technical details, especially about how they efficiently update the graph edges as the system evolves and particles move around so as to facilitate inter-particle interactions. The interview with Jonathan Godwin also remains high-level, but does point out that the model's weakness is in simulating large rigid bodies, as graph networks struggle to pass information between two ends of a large rigid body quickly enough.
Related (to-explore):
- Relevant paper (2020): https://arxiv.org/abs/2002.09405
- Original papers (2018) on Graph Networks from DeepMind: https://arxiv.org/abs/1806.01242, https://arxiv.org/abs/1806.01261
Link: https://www.youtube.com/watch?v=KuXjwB4LzSA
A video with great visualizations to get a mathematical appreciation of the subject. Also gives a teaser on how FFT can be used to speed up this operation, with links to other great videos.
Link: https://arxiv.org/abs/1503.02531
Goal: Transfer knowledge from a larger model (say an ensemble) to a smaller model, effectively and efficiently.
Core idea: A "student" model learns from the "soft" labels generated by a larger "teacher" model, instead of the groundtruth ("hard") labels.
- Having the student model see a richer view of the similarity structure through soft labels seems to help transfer inductive biases (think: generalizing assumptions) from the teacher to the student.
- Useful for compressing knowledge from an ensemble model to a single model
- The student model was observed to learn and generalize well with a fraction of training data (e.g. 3% of original train set) with soft-labels
- The main limitation of this would be the computational cost of training a large model, generating soft-labels, and the storage cost of storing those soft-labels
Further reading:
- S. Abnar et. al. follow-up of Knowledge Distillation https://arxiv.org/abs/2006.00555
Link: https://arxiv.org/abs/2010.11929
- Input image is split into a number of small patches of fixed size
- Each patch is projected into D dimensension using a trainable linear transformation
- Alternatively, the image could be passed through CNN layers first and patches could be formed from CNN feature maps
- The patch embeddings (+position embeddings) are fed into the transformer encoder, prepended by a [CLS] token embedding (similar to BERT, learnable)
- Model is pre-trained only on classification task (unlike BERT, with 2 tasks)
- TODO Experiments section needs a better look
- Can the model handle images of different aspect ratios?
Further reading:
- H. Touvron et. al. Distilled Vision Transformers https://arxiv.org/abs/2012.12877
- M. Dehghani et. al. Scaling to 22B parameters https://arxiv.org/abs/2302.05442