question about attention patterns #28

SUDA-HLT-ywfang · 2023-12-05T13:36:48Z

Hi!
In Figure 5 of the blog, it seems like tokens of the current iteration attend to tokens from previous iterations. For example, the token at position 6 in red attends to token at position 5 in green.
But in Jacobi decoding, is it supposed to attend to tokens from the current iteration? That is: the token at position 6 in red attends to token at position 5 in red.

Viol2000 · 2023-12-06T11:36:56Z

Hi, thanks for your interest!
We are a bit different from Jacobi decoding. And the number in fugure 5 shows a relative posision (assuming the input is position 0).

SUDA-HLT-ywfang · 2023-12-06T12:17:53Z

Thank you for your reply! I'm still a little bit confused.

Without the basic form of Jacobi decoding, how to guarantee that lookahead decoding has the exact same results as autoregressive decoding?
From my understanding, if the sequence is [a, b, c, d, e] and "c" is position 0, the input is [a, b, c] and "e" is position 2, is that right?

Viol2000 · 2023-12-06T12:28:23Z

Hi,

We use a verification branch to guarantee the output is the same as autoregressive decoding. For example, in Figure 5, we verify two speculations: deep blue 0 + upper blue 1, 2, 3 and deep blue 0 + lower blue 1, 2, 3. This verification is similar to speculative decoding. For example, we compare the softmax output of deep blue 0 and the upper blue 1. If it matches, we will accept upper blue 1 is the next token and go on comparing upper blue 1's output and upper blue 2. And so on.
Yes. If the sequence is [a, b, c, d, e], and 'c' is the current input position 0. Note that a and b will not be input (they are stored in kv-cache). Then, d is position one, and e is position 2.

SUDA-HLT-ywfang · 2023-12-08T10:26:46Z

In Figure 5, position 6 in red actually attends to position 5 in green (as the red arrow), instead of position 5 in red (as the green arrow). Why is that, considering that position 5 in red is the latest iteration result? So, you can get a more accurate trajectory by attention like this?

Viol2000 · 2023-12-08T14:24:44Z

Hi @FrankCast1e , my idea is that the red 6 is generated by the sequence: some 3, orange 4, green 5. This makes a strong local relation if these 3,4,5 tokens can form an n-gram phase. In this turn, we can use orange 4, green 5, and red 6 to generate the next token to form another meaningful n-gram. If you use red 5 as the previous token of red 6, I think it does not make much sense as the red 6 has no relationship with red5, and it may not generate a meaningful n-gram.
And, if you change the last token of red 6, which will be the last token of red 5? I think it should be carefully investigated and form another substitute solution.

SUDA-HLT-ywfang · 2023-12-13T08:54:56Z

Thank you very much for your explanation! I totally get the idea right now.

SUDA-HLT-ywfang closed this as completed Dec 13, 2023

shermansiu mentioned this issue Jan 12, 2024

Questions on combined attention mask structure for Jacobi iteration #44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about attention patterns #28

question about attention patterns #28

SUDA-HLT-ywfang commented Dec 5, 2023

Viol2000 commented Dec 6, 2023

SUDA-HLT-ywfang commented Dec 6, 2023 •

edited

Loading

Viol2000 commented Dec 6, 2023

SUDA-HLT-ywfang commented Dec 8, 2023

Viol2000 commented Dec 8, 2023

SUDA-HLT-ywfang commented Dec 13, 2023

question about attention patterns #28

question about attention patterns #28

Comments

SUDA-HLT-ywfang commented Dec 5, 2023

Viol2000 commented Dec 6, 2023

SUDA-HLT-ywfang commented Dec 6, 2023 • edited Loading

Viol2000 commented Dec 6, 2023

SUDA-HLT-ywfang commented Dec 8, 2023

Viol2000 commented Dec 8, 2023

SUDA-HLT-ywfang commented Dec 13, 2023

SUDA-HLT-ywfang commented Dec 6, 2023 •

edited

Loading