diff --git a/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/README.md b/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/README.md new file mode 100644 index 0000000..a60b19c --- /dev/null +++ b/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/README.md @@ -0,0 +1,9 @@ +# Vision Transformers provably learn spatial structure + +## Abstract + +A key question in vision transformers (ViTs) is how they achieve performance comparable to CNNs without having built-in spatial inductive biases. The paper explains theoretically ViTs learn spatial structure through gradient descent optimization. The authors prove that ViTs implicitly learn to group related image patches (“patch association”) through a three-phase process while minimizing their training objective. This learned structure enables efficient transfer learning and explains ViTs’ practical success. Experiments on CIFAR-10/100, SVHN, and ImageNet demonstrate that ViTs can learn general spatial relationships even when image patches are randomly permuted, achieving competitive performance with a simplified positional attention mechanism (68.9% vs 71.9% accuracy on ImageNet). + +## Source paper + +[Vision Transformers provably learn spatial structure](https://arxiv.org/abs/2210.09221) diff --git a/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/Vision_Transformers_provably_learn_spatial_structure.pdf b/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/Vision_Transformers_provably_learn_spatial_structure.pdf new file mode 100644 index 0000000..ccba4bf Binary files /dev/null and b/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/Vision_Transformers_provably_learn_spatial_structure.pdf differ diff --git a/README.md b/README.md index 5b0038c..144e11a 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ Join us at https://meet.drwhy.ai. * 28.10 - [Adversarial examples vs. context consistency defense for object detection](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_10_28_Adversarial_attacks_against_object_detection.md) - Hubert Baniecki * 04.11 - [Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_11_04_Unlocking_the_Power_of_Spatial_and_Temporal_Information_in_Medical_Multimodal_Pre-training) - Bartosz Kochański * 18.11 - User study: Visual Counterfactual Explanations for Improved Model Understanding - Bartek Sobieski -* 25.11 - Vision Transformers provably learn spatial structure - Vladimir Zaigrajew +* 25.11 - [Vision Transformers provably learn spatial structure](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure) - Vladimir Zaigrajew * 02.12 - Null-text Inversion for Editing Real Images using Guided Diffusion Models - Dawid Płudowski * 09.12 - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training - Tymoteusz Kwieciński * 20.01 - Connecting counterfactual and attributions modes of explanation - Jan Jakubik