Self-attention in O(N) while still letting all the tokens to talk to each other #587
IlyaGazman
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
After watching your video I am stuck with an idea that some parts of the attention algorithm can run in linear time.
Specifically I am referring to the part where you talk about the mathematical trick in self-attention:
https://youtu.be/kCc8FmEb1nY?si=kfnkvp8iOLUKOgW7&t=2542
If f(n) is the avg of all the elements in array A up to n then f(n) = (f(n-1) * (n-1) + A[n]) / n
I know that it is the heart of self attention but can this linearity be expressed later on? If not where it breaks?
Beta Was this translation helpful? Give feedback.
All reactions