-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #80 from WolodjaZ/master
Added Vision_Transformers_provably_learn_spatial_structure paper
- Loading branch information
Showing
3 changed files
with
10 additions
and
1 deletion.
There are no files selected for viewing
9 changes: 9 additions & 0 deletions
9
2024/2024_11_25_Vision_Transformers_provably_learn_spatial_structure/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Vision Transformers provably learn spatial structure | ||
|
||
## Abstract | ||
|
||
A key question in vision transformers (ViTs) is how they achieve performance comparable to CNNs without having built-in spatial inductive biases. The paper explains theoretically ViTs learn spatial structure through gradient descent optimization. The authors prove that ViTs implicitly learn to group related image patches (“patch association”) through a three-phase process while minimizing their training objective. This learned structure enables efficient transfer learning and explains ViTs’ practical success. Experiments on CIFAR-10/100, SVHN, and ImageNet demonstrate that ViTs can learn general spatial relationships even when image patches are randomly permuted, achieving competitive performance with a simplified positional attention mechanism (68.9% vs 71.9% accuracy on ImageNet). | ||
|
||
## Source paper | ||
|
||
[Vision Transformers provably learn spatial structure](https://arxiv.org/abs/2210.09221) |
Binary file added
BIN
+2.89 MB
...provably_learn_spatial_structure/Vision_Transformers_provably_learn_spatial_structure.pdf
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters