Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about attention patterns #28

Closed
SUDA-HLT-ywfang opened this issue Dec 5, 2023 · 6 comments
Closed

question about attention patterns #28

SUDA-HLT-ywfang opened this issue Dec 5, 2023 · 6 comments

Comments

@SUDA-HLT-ywfang
Copy link

Hi!
In Figure 5 of the blog, it seems like tokens of the current iteration attend to tokens from previous iterations. For example, the token at position 6 in red attends to token at position 5 in green.
But in Jacobi decoding, is it supposed to attend to tokens from the current iteration? That is: the token at position 6 in red attends to token at position 5 in red.

@Viol2000
Copy link
Collaborator

Viol2000 commented Dec 6, 2023

Hi, thanks for your interest!
We are a bit different from Jacobi decoding. And the number in fugure 5 shows a relative posision (assuming the input is position 0).

@SUDA-HLT-ywfang
Copy link
Author

SUDA-HLT-ywfang commented Dec 6, 2023

Thank you for your reply! I'm still a little bit confused.

  1. Without the basic form of Jacobi decoding, how to guarantee that lookahead decoding has the exact same results as autoregressive decoding?
  2. From my understanding, if the sequence is [a, b, c, d, e] and "c" is position 0, the input is [a, b, c] and "e" is position 2, is that right?

@Viol2000
Copy link
Collaborator

Viol2000 commented Dec 6, 2023

Hi,

  1. We use a verification branch to guarantee the output is the same as autoregressive decoding. For example, in Figure 5, we verify two speculations: deep blue 0 + upper blue 1, 2, 3 and deep blue 0 + lower blue 1, 2, 3. This verification is similar to speculative decoding. For example, we compare the softmax output of deep blue 0 and the upper blue 1. If it matches, we will accept upper blue 1 is the next token and go on comparing upper blue 1's output and upper blue 2. And so on.
  2. Yes. If the sequence is [a, b, c, d, e], and 'c' is the current input position 0. Note that a and b will not be input (they are stored in kv-cache). Then, d is position one, and e is position 2.

@SUDA-HLT-ywfang
Copy link
Author

In Figure 5, position 6 in red actually attends to position 5 in green (as the red arrow), instead of position 5 in red (as the green arrow). Why is that, considering that position 5 in red is the latest iteration result? So, you can get a more accurate trajectory by attention like this?
image

@Viol2000
Copy link
Collaborator

Viol2000 commented Dec 8, 2023

Hi @FrankCast1e , my idea is that the red 6 is generated by the sequence: some 3, orange 4, green 5. This makes a strong local relation if these 3,4,5 tokens can form an n-gram phase. In this turn, we can use orange 4, green 5, and red 6 to generate the next token to form another meaningful n-gram. If you use red 5 as the previous token of red 6, I think it does not make much sense as the red 6 has no relationship with red5, and it may not generate a meaningful n-gram.
And, if you change the last token of red 6, which will be the last token of red 5? I think it should be carefully investigated and form another substitute solution.

@SUDA-HLT-ywfang
Copy link
Author

Thank you very much for your explanation! I totally get the idea right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants