-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Average Attention or Attention on last token? #14
Comments
Now I get it. Essentially, you can't do filtering at first step if you wanna reproduce the performance(w/o cache). But at following step, you can only access the last token. This is reason for last-token-guided ranking |
Hi @feiyu12138, I'm actually having the same question. Would you mind elaborating more on what do you mean by "you can't do filtering at first step if you wanna reproduce the performance (w/o cache)"? |
Hi, |
I'm not yet familiar with the internal inference and KV cache of LLMs (only have high-level ideas), so I haven't really followed you here. Let's assume KV cache is not used (which should make things simpler?) Let's say we prune the tokens at layer K. Then we will have all image tokens available across layer 1 to K - 1, regardless of which answer token is being generated, right? Then I don't see why pruning at step 1 will affect step 2 and beyond. I know it's definitely not your job to answer my question, so I do appreciate your time and discussion. |
You are right about "all image tokens available across layer 1 to K - 1". However, in case that the second token has different attention on visual tokens, then it's possible that for the second token generation, we have to use different tokens at layer K to 32. But if you didn't store all KV states, there is no available KV at layer K to 32 you need. |
@chenllliang Is it true that the first generated token's forward pass won't have vision tokens pruned? If so I'm really confused since at least when evaluating aokvqa the max new tokens is set to 1. Would appreciate confirmation and thoughts from the authors too. |
@zjysteven , FastV works differently between with and without KV cache, I'll explain seperately.
@feiyu12138 thanks for your clear explaination as well! |
That makes perfect sense and clears up my confusion. Thank you both! |
Hi,
In the paper and comments, it shows (to rerank the visual tokens) average attention is calculated across all tokens. However, the code shows it's actually calculating the attention on the last token, which is obviously different from the description. Would you mind make it clear which strategy is better? Thank you very much!
The text was updated successfully, but these errors were encountered: