Merge pull request #80 from WolodjaZ/master

Added Vision_Transformers_provably_learn_spatial_structure paper
MI2DataLab · Nov 28, 2024 · 552f1c9 · 552f1c9
2 parents 2ed63a7 + fe4ac28
commit 552f1c9
Show file tree

Hide file tree

Showing 3 changed files with 10 additions and 1 deletion.
diff --git a/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/README.md b/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/README.md
@@ -0,0 +1,9 @@
+# Vision Transformers provably learn spatial structure
+
+## Abstract
+
+A key question in vision transformers (ViTs) is how they achieve performance comparable to CNNs without having built-in spatial inductive biases. The paper explains theoretically ViTs learn spatial structure through gradient descent optimization. The authors prove that ViTs implicitly learn to group related image patches (“patch association”) through a three-phase process while minimizing their training objective. This learned structure enables efficient transfer learning and explains ViTs’ practical success. Experiments on CIFAR-10/100, SVHN, and ImageNet demonstrate that ViTs can learn general spatial relationships even when image patches are randomly permuted, achieving competitive performance with a simplified positional attention mechanism (68.9% vs 71.9% accuracy on ImageNet).
+
+## Source paper
+
+[Vision Transformers provably learn spatial structure](https://arxiv.org/abs/2210.09221)
diff --git a/...provably_learn_spatial_structure/Vision_Transformers_provably_learn_spatial_structure.pdf b/...provably_learn_spatial_structure/Vision_Transformers_provably_learn_spatial_structure.pdf
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ Join us at https://meet.drwhy.ai.
 * 28.10 - [Adversarial examples vs. context consistency defense for object detection](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_10_28_Adversarial_attacks_against_object_detection.md) - Hubert Baniecki
 * 04.11 - [Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_11_04_Unlocking_the_Power_of_Spatial_and_Temporal_Information_in_Medical_Multimodal_Pre-training) - Bartosz Kochański
 * 18.11 - User study: Visual Counterfactual Explanations for Improved Model Understanding - Bartek Sobieski
-* 25.11 - Vision Transformers provably learn spatial structure - Vladimir Zaigrajew
+* 25.11 - [Vision Transformers provably learn spatial structure](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure) - Vladimir Zaigrajew
 * 02.12 - Null-text Inversion for Editing Real Images using Guided Diffusion Models - Dawid Płudowski
 * 09.12 - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training - Tymoteusz Kwieciński
 * 20.01 - Connecting counterfactual and attributions modes of explanation - Jan Jakubik