Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on the attention mask, and whether to accept the last element of guess_results when all guess_tokens are accepted #32

Open
YingHH1 opened this issue Dec 7, 2023 · 6 comments

Comments

@YingHH1
Copy link

YingHH1 commented Dec 7, 2023

It was mentioned in #14 that yellow 7 can see orange1-4, green 5 and red 6. However, as I have thought it was orange 4, green 5, red 6 and yellow 7 that form a 4-gram, and orange 1-3 is irrelevant here so they should be masked, or am I misunderstanding something?

On a different question, if all guess_tokens matches guess_results[0:-1], then should the Lookahead step also accept the last element guess_results[-1]? (since this is a complete sentence)

Many thanks for the help

@hsm1997
Copy link

hsm1997 commented Dec 7, 2023

it was orange 4, green 5, red 6 and yellow 7 that form a 4-gram,

yes, but orange 1-3 is another 3-gram before this 4-gram, so they should also be taken by yellow7 so that yellow7 attends to a complete sentence.

should the Lookahead step also accept the last element guess_results[-1]?

I think this possibly works as well. But conceptually, the guess_results are just used to verify the guess_tokens. The tokens to be accepted should be chosen from "verified guess tokens", not the "tokens used for verification"

@YingHH1
Copy link
Author

YingHH1 commented Dec 8, 2023

yes, but orange 1-3 is another 3-gram before this 4-gram, so they should also be taken by yellow7 so that yellow7 attends to a complete sentence.

But I do not see how orange 1-4 should have any connections at all. They are part of different 4-grams, and orange 1-3 is not a 3-gram (N-gram are those across the different colours) if I am not mistaken. When I inspect an example, I see tokens in orange 1-4 do not form a coherent phrase.

@hsm1997
Copy link

hsm1997 commented Dec 8, 2023

N-gram are those across the different colours

I guess that since the author assign each token with a specific number in blog figure 5, that number stands for the token's "expected position index" within the context

I see tokens in orange 1-4 do not form a coherent phrase

I see that as the decoding process goes on, orange 1-4 gradually form a coherent phrase, for example when steps=15:
截屏2023-12-08 15 13 28

@YingHH1
Copy link
Author

YingHH1 commented Dec 8, 2023

Thank you very much for the response.

I still have trouble understand why orange 1-4 should have connections. I guess this is because we use causal mask in the first context decoding step (where a conventional triangular mask is used so that orange tokens can see their preceding tokens), is this the reason?

If so, why not do the same for green 1-5 and red 1-5 so that they can also see their preceding tokens (so six lower triangles in mask as opposed to the current three under the orange tokens)?

@hsm1997
Copy link

hsm1997 commented Dec 11, 2023

orange 1-4 are "guessed" to be 4-grams, but the "collected 4-gram" are indeed generated in an autoregressive pattern. And if you

do the same for green 1-5 and red 1-5

there would be no autoregressive pattern within the "guess decoding" process, and the probability of "n-gram-guess-is-right" might decrease. (ps. just a personal guess here:-))
Besides, there should not be six lower triangles in masks. For example, green 5 can not attend to orange 1-4 and green 1-4 at the same time.

@YingHH1
Copy link
Author

YingHH1 commented Dec 11, 2023

I think I can slowly grasp what is happening here now. We need that blue 0 and orange 1-4 to build connections so that their corresponding 4-grams (five collected 4-grams in this case) are all relevant to the prompt context. Otherwise, some of the 4-grams are useless as they almost have no connections to the prompt context (even though they are coherent 4-grams). Thus, connecting blue 0 and orange 1-4 in an auto-regressive manner can lead to better acceptance rate, since they are the first tokens to any collected 4-grams.

I guess this is the reason why we want the blue 0 and orange 1-4 to form a sentence :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants