Self-attention in O(N) while still letting all the tokens to talk to each other #587

IlyaGazman · 2024-06-13T18:23:23Z

IlyaGazman
Jun 13, 2024

After watching your video I am stuck with an idea that some parts of the attention algorithm can run in linear time.
Specifically I am referring to the part where you talk about the mathematical trick in self-attention:
https://youtu.be/kCc8FmEb1nY?si=kfnkvp8iOLUKOgW7&t=2542

If f(n) is the avg of all the elements in array A up to n then f(n) = (f(n-1) * (n-1) + A[n]) / n

I know that it is the heart of self attention but can this linearity be expressed later on? If not where it breaks?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-attention in O(N) while still letting all the tokens to talk to each other #587

{{title}}

Replies: 0 comments

Select a reply

Self-attention in O(N) while still letting all the tokens to talk to each other #587

IlyaGazman Jun 13, 2024

Replies: 0 comments

IlyaGazman
Jun 13, 2024