Skip to content

Commit

Permalink
Merge pull request #80 from WolodjaZ/master
Browse files Browse the repository at this point in the history
Added Vision_Transformers_provably_learn_spatial_structure paper
  • Loading branch information
sobieskibj authored Nov 28, 2024
2 parents 2ed63a7 + fe4ac28 commit 552f1c9
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 1 deletion.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Vision Transformers provably learn spatial structure

## Abstract

A key question in vision transformers (ViTs) is how they achieve performance comparable to CNNs without having built-in spatial inductive biases. The paper explains theoretically ViTs learn spatial structure through gradient descent optimization. The authors prove that ViTs implicitly learn to group related image patches (“patch association”) through a three-phase process while minimizing their training objective. This learned structure enables efficient transfer learning and explains ViTs’ practical success. Experiments on CIFAR-10/100, SVHN, and ImageNet demonstrate that ViTs can learn general spatial relationships even when image patches are randomly permuted, achieving competitive performance with a simplified positional attention mechanism (68.9% vs 71.9% accuracy on ImageNet).

## Source paper

[Vision Transformers provably learn spatial structure](https://arxiv.org/abs/2210.09221)
Binary file not shown.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Join us at https://meet.drwhy.ai.
* 28.10 - [Adversarial examples vs. context consistency defense for object detection](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_10_28_Adversarial_attacks_against_object_detection.md) - Hubert Baniecki
* 04.11 - [Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_11_04_Unlocking_the_Power_of_Spatial_and_Temporal_Information_in_Medical_Multimodal_Pre-training) - Bartosz Kochański
* 18.11 - User study: Visual Counterfactual Explanations for Improved Model Understanding - Bartek Sobieski
* 25.11 - Vision Transformers provably learn spatial structure - Vladimir Zaigrajew
* 25.11 - [Vision Transformers provably learn spatial structure](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure) - Vladimir Zaigrajew
* 02.12 - Null-text Inversion for Editing Real Images using Guided Diffusion Models - Dawid Płudowski
* 09.12 - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training - Tymoteusz Kwieciński
* 20.01 - Connecting counterfactual and attributions modes of explanation - Jan Jakubik
Expand Down

0 comments on commit 552f1c9

Please sign in to comment.