-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Fix incorrect causual attention mask caused by M-Rope #15474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…tten masking will break. See ggml-org#13694. This also allows us to remove some previous workarounds
This is my first time PR. Thanks for your help! @FMayran @ggerganov @ngxson |
The modification in llama-graph.cpp deserves a comment regarding why we add n_tokens (--> because we want to start with the second component of pos, since the first component is the linearly increasing position, irrelevant to actual positional encoding). |
Yes, will add comments tomorrow.
BTW: have you tested MiMoVL0528 with this PR? I forgot to test it.
发自我的iPhone
|
Added some comments as suggested by @FMayran |
Note: this PR's output matches transformers and vllm, if the prompt is given exactly the same. However, in current tests, the prompts for vllm is "image before text", but currently For details, see: |
@ggerganov This PR contains an important fix, the llama.cpp has significantly improved, what do we need to do to start the approval process? |
return 1; // for M-RoPE, the whole image is 1 in temporal dimension | ||
} | ||
return image_tokens->n_tokens(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this will cause unexpected behavior
Could you elaborate? What unexpected behavior?
What is your suggested change?
发自我的iPhone
…------------------ Original ------------------
From: Xuan-Son Nguyen ***@***.***>
Date: Mon,Aug 25,2025 9:06 PM
To: ggml-org/llama.cpp ***@***.***>
Cc: Rujia Liu ***@***.***>, Author ***@***.***>
Subject: Re: [ggml-org/llama.cpp] Fix incorrect causual attention mask causedby M-Rope (PR #15474)
@ngxson commented on this pull request.
In tools/mtmd/mtmd.cpp:
> @@ -1024,9 +1024,6 @@ const char * mtmd_image_tokens_get_id(const mtmd_image_tokens * image_tokens) { } llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) { - if (image_tokens->use_mrope_pos) { - return 1; // for M-RoPE, the whole image is 1 in temporal dimension - } return image_tokens->n_tokens(); }
Removing this will cause unexpected behavior
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Ah I understand what you mean now. Thanks. Before my fix, removing this will lead sudden jumps yes, but after my fix, the first dimension of pos reverts to its original meaning: the auto-increasing "serial number" for each token, so it's not jumping anymore. |
Looking at your comment again, I think I misunderstood you. So I'm replying again.
(I can't see your posted image due to proxy settings so sorry if your image contains the answer) Actually I was also trying to preserve this behavior (repeating position for all image tokens), but cannot find a way to fix causal attention masking properly. It looks like the code assumes the positions are increasing without repetition? |
This patch is like the missing glasses for qwen vl, I was looking for vision issues because it kept refering to arms as legs at times and when pushed for detailed descriptions the whole anatomy was more lovecraftian horror than human. looks like I found the right one, that problem appears to be fixed by this patch. Continuing the chat with second message with a different image doesn't really work, but that is probably a different issue. |
Thanks for the info! I'm not an expert in LLM (nor in llama.cpp codebase) so I'm mainly targeting vLLM/transformers parity first, then trying to understand whether it's the "correct" way. Could you please provide more information about your use case, including the second message that you think "is a different issue"? If it's really another issue, I'll try to investigate and make another PR. @Hoernchen |
The usage was classifying the poses of people including visibility of limbs/hands. I thought it can't be that hard, right? Well. Pose models like mediapipe/vitpose/sapiens tend to hallucinate depending on cropping/clothes and like to place imaginary hands somewhere and do not offer reliable visibility scores, so the idea was to give a "proper" vision model a shot. The vision models can be better when prompted like "find the shoulders, follow the arms, to the hands, are the hands visible/part of the image/obscured". That obviously only works if the image is not somehow garbled. It's even more obvious if you try to "help" the model using a depth image (or two images, color+depth) generated using https://github.com/DepthAnything/Depth-Anything-V2 . This is basically a quick, poor mans "segmentation" without having to rely upon actual accurate segmentation, and the color gradient the model describes around the "shoulders" does not match the image but appears to be from a different part of the image. As for the second message, I don't know, using the llama.cpp with qwen vl works if you start a new chat with your instructions and one image to describe, but if you continue the conversation and you add another image to the second message it fails to describe it. I don't know it that is just because the model is stubborn, because the second image is garbled again, or maybe because there is no second image and the model just hallucinates completely. That is unfortunately hard to debug. Adding multiple images to the first prompt at once appears to have issues as well, I don't recall if this was qwen or gemma3, but it led to getting descriptions of "all" 5 to 13 images - with 8 actual images! Some of those descriptions didn't match any image, some appered to be repeats. |
Thanks for the explanation! I understand the basic ideas now. I wonder if you could post an example here (or send me via email) so that I can first run it with vLLM and transformers, to know whether it's a limitation of the model, or just llama.cpp's issue. Actually the exact format of conversation (including the order of message) matters a lot, so without detailed messages, it's hard to say. |
@ngxson : what is currently needed for this to be merged? We're currently forced to use a build of llama.cpp built from this PR's code, eagerly awaiting for a proper release. If needed, I am willing to help. |
I just had a look inside transformer's implementation What it does is the following for text located after an image: token_pos = llm_pos_ids_list[-1].max() + 1, where llm_pos_ids_list[-1] is essentially [expand(0..t), expand(0..h), expand(0..w)] + previous_token_pos. For images, the time part is just a tensor of zeros. For videos, it is frame_nb * seconds_per_frame * config.vision_config.tokens_per_second. In other words, first text pos = max(h,w,t) + previous_token_pos. This is consistent with their documentation of the function " Here we calculate the text start position_ids as the max vision position_ids plus 1." but not what @ngxson said So, even though it looks wrong, transformers'implementation is +max(h,w,t) Edit: |
Yes! It's described in Qwen2-VL's paper, section 2.1:
However, it's not in Qwen2.5-VL's paper so I forgot it. I only read Qwen2-VL's paper once very quickly to "get a basic idea", and Qwen2.5-VL's paper several times during implementation. Thanks for the reminder!
Maybe we just need to change |
@rujialiu |
OK, I have a demonstration of a working fix which preserves token pos numbering, by adding a map tu ubatch to link the ubatch token position and position in the kv cache. It is ugly but it works. Can someone (@ngxson @rujialiu) check if it is fine for you? my fork: |
I don't have time to test it thoroughly but after reading the code I'd say I like your fork more than my own PR 😄 @FMayran |
@Googulator Could you test @FMayran 's fork above? It looks quite promising and I wish it would be used instead of mine. |
Tested @FMayran 's vs @rujialiu 's vs the mainline version:
(note: backticks in the output were changed to forward ticks to avoid breaking GitHub's formatting) |
Observations:
|
Thanks @Googulator Also, could you check FMayran's with --temp 0? I'm curious why "not perfectly identical". BTW: Are you testing my branch as-is (which is based on an earlier version of |
@FMayran @rujialiu Thanks for you work again! I've tested the QwenVL 2.7 7B Q8_0 with the CPU version, just like @Googulator, using temp 0.0, see the results: Reference vLLM results: [
{"bbox_2d": [195, 678, 460, 837], "color": "red"},
{"bbox_2d": [312, 547, 508, 792], "color": "green"},
{"bbox_2d": [629, 708, 700, 775], "color": "black"}
] rujialiu version (#15474):
[
{"bbox_2d": [168, 679, 434, 837], "label": "red rectangle"},
{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
{"bbox_2d": [599, 708, 672, 775], "label": "black rectangle"}
]
[
{"bbox_2d": [168, 679, 433, 836], "label": "red rectangle"},
{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
{"bbox_2d": [602, 708, 671, 775], "label": "black rectangle"}
] FMayran version (https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix): Some compiler warnings here:
[
{"bbox_2d": [168, 679, 462, 837], "color": "red"},
{"bbox_2d": [284, 575, 480, 792], "color": "green"},
{"bbox_2d": [601, 708, 672, 775], "color": "black"}]
[
{"bbox_2d": [168, 679, 462, 837], "color": "red"},
{"bbox_2d": [284, 575, 480, 792], "color": "green"},
{"bbox_2d": [601, 708, 672, 775], "color": "black"}
] There are no significant differences in the results, but 'label' vs 'color' is a little bit unexpected. Another test using F16 GGUF: rujialiu version (#15474):
[
{"bbox_2d": [168, 679, 434, 837], "label": "red rectangle"},
{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
{"bbox_2d": [602, 708, 672, 775], "label": "black rectangle"}
] Another test using AMD 7900XTX GPU (ROCm 6.3):
[
{"bbox_2d": [168, 679, 433, 836], "label": "red rectangle"},
{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
{"bbox_2d": [599, 708, 672, 775], "label": "black rectangle"}
]
[
{"bbox_2d": [168, 679, 434, 865], "label": "red rectangle"},
{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
{"bbox_2d": [599, 708, 672, 803], "label": "black rectangle"}
] FMayran version (https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix):
[
{"bbox_2d": [168, 679, 462, 837], "color": "red"},
{"bbox_2d": [284, 575, 480, 792], "color": "green"},
{"bbox_2d": [601, 708, 672, 775], "color": "black"}
]
Notes:
|
@broadbit-hu Thanks for the tests again! Did you change |
I changed the order of the image and prompt arguments in the command line, but I think it doesn’t matter (same results).
Does your test differ from mine? 3 weeks ago:
[
{"bbox_2d": [168, 679, 433, 836], "label": "red rectangle"},
{"bbox_2d": [284, 547, 480, 792], "label": "green rectangle"},
{"bbox_2d": [599, 708, 672, 775], "label": "black rectangle"}
]
[
{"bbox_2d": [166, 678, 462, 838], "color": "red"},
{"bbox_2d": [310, 546, 509, 794], "color": "green"},
{"bbox_2d": [629, 706, 700, 802], "color": "black"}
]
There are no significant differences in the results of batch size 1 vs 2048 in my test, there's no problem with temperature 0. |
The non‑zero (e.g., 0.6) temperature produces very strange coordinates with @rujialiu's version: rujialiu version (#15474):
[
{"bbox_2d": [628, 703, 700, 777], "label": "rectangle"},
{"bbox_2d": [167, 677, 436, 838], "label": "rectangle"},
{"bbox_2d": [283, 545, 481, 793], "label": "rectangle"}
]
[
{"bbox_2d": [169, 680, 436, 837], "label": "red rectangle"},
{"bbox_2d": [284, 576, 481, 796], "label": "green rectangle"},
{"bbox_2d": [598, 708, 674, 777], "label": "black rectangle"}
] FMayran version (https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix):
[
{"bbox_2d": [169, 674, 464, 840], "color": "red"},
{"bbox_2d": [311, 573, 508, 795], "color": "green"},
{"bbox_2d": [604, 706, 671, 775], "color": "black"}
]
[
{"bbox_2d": [169, 674, 461, 838], "color": "red"},
{"bbox_2d": [285, 575, 481, 795], "color": "green"},
{"bbox_2d": [599, 708, 671, 776], "color": "black"}
] |
Yeah, changing the order of cli parameter is not helpful. The actual order is hard-coded into the code. You need to follow my instruction #13694 (comment)
Yes, because I changed the code 😄 |
Oh! That's not good. I definitely will look into this. I hope just adding the BTW: Colors are lost, but coordinates are reasonable (it's a permutation 😄 ) |
Ok, I've thought about it carefully. If we aim to 100% reproduce transformer's/vllm/official paper's behavior, then @FMayran 's approach seems mandatory because the "postion" values sent to LLM is not increasing. Here is an example. Suppose we have a 2x2 (merged token count) image, and there are ~100 tokens before the image, then the position values are:
Now it's clear. We cannot use this non-increasing sequence values to check for causuality. My PR reduced the incorrect causual attention mask, but did not fully eliminate them. Conclusion: I propose to use @FMayran 's solution in favour of mine. However, we'd better to figure out a way to avoid changing ubatch, or at least in a more gentle way. Now I'm feeling more knowledgable. Could you @ngxson give some advice when you're less busy? |
My knowledge about this part was from the illustration from Qwen's paper, so I could be wrong. But in anyway, the next token should not be |
if (image_tokens->use_mrope_pos) { | ||
return 1; // for M-RoPE, the whole image is 1 in temporal dimension | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on what discussed so far, here we should return max(h, w)
However, I don't know how it will impact the rest of the code. Make sure to test it.
hmmm... I think I have come up with an ugly way to avoid changing Edit: Failed. Still need to change |
That was actually my first solution, until I realised that pos in the cache has two meanings:
In the code, "n_past" clearly assumes that the number of tokens past (the "age") is also the base position for positional encoding, which is wrong, at least for Qwen2.5VL. The mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) function is just a hotfix of that meaning, but can't solve the underlying issue, as it does not, and cannot, fix both meanings. We either break causal masking, or break encoding of the next chunk. Also, the cache's pos is a single value per token, as opposed to multiple, so no solution based on having multiple pos per token in the ubatch will fix that (we lost access to multiple pos in the ubatch at this point because of the lack of linkage between ubatch and cache which I was forced to add in my fix). If we add the linkage like I did, then pos is not necessary anymore :-). More generally, I think it is important to keep the link between tokens in the ubatch and these tokens inserted in the cache until all these tokens have been decoded, not only for Qwen but for all models with positional encoding. Indeed, causal masking should only depend on the position in the ubatch, not on positionel encoding, and not on anything other than the ubatch in the cache (the meaning is to mask future tokens, these necessarily come from the ubatch we are processing unless we want to rewrite the past computations). |
@broadbit-hu warnings due to struct initialization should be gone in my new commit. The kv_position_of_token should probably be moved to the "data" class for storage, with preallocation instead of my lazy-allocation scheme, but this is not a serious matter. |
@ggerganov After several weeks of discussions and joint effort from many people, here is the summary of the current situation to save your time. No need to check previous messages in this PR (except @FMayran 's last long message because it didn't exist when I was writing this): The root cause is that the meaning of the word "position" is "polymophic" and multi-functional:
However, in MLLM point 2 and 3 above are no longer true: it can be repeating and "jumping". In this PR, I tried to maintain a linearly-increasing pos sequence for M-rope (so it becomes 5D under the hood) and use it for causual attention mask calculation. The result seems to be much better (see all the tests above), but still incorrect, according to the paper and @FMayran's code produces similar and sometimes better result than mine, and it looks likea proper solution, except that it adds a temporary Also, I wonder whether some kind of internal assumption (increasing? unique?) of kv cache is already violated? Should kv cache use "strict increasing position" instead? Anyway, here is the only way (if we can't change
Can I take this route? Or you prefer @FMayran's approach (which is much cleaner and simpler) but try to make the change more "future-proof" (e.g. add a genral-purpose auxiliary structure in Now I think we know what logic to implement, the real challenge is how to do it with |
Yes! That is very annoying.
Exactly!
Definitely! BTW: I was writing a long reply (see above) before seeing your message. So it may seem to be repeating your idea. But it's actually because we're thinking similarly 😄 |
Could you write down a specific example of positions and indices that we need to track, so I can understand better what is the current limitation of the For example: 3 text tokens, followed by 5 image embedding followed by 4 text tokens. |
Thanks! @ggerganov
|
Thanks. I still don't understand why this is incompatible with the existing logic. We can process this using 3 separate batches:
|
Yes, but within the image chunk, when calculating attention mask, in function const llama_pos p1 = ubatch->pos[i];
const uint64_t idst = n_kv*(h*n_stream*n_tps_pad + s*n_tps_pad + ii);
for (uint32_t j = 0; j < n_kv; ++j) {
if (cells.is_empty(j)) {
continue;
}
// mask the token if not the same sequence
if (!cells.seq_has(j, seq_id)) {
continue;
}
const llama_pos p0 = cells.pos_get(j);
// mask future tokens
if (causal_attn && p0 > p1) {
continue; suppose p1 represents the first image token and p0 represents the second image token (a future token), since p0==p1==5, it will not go into (I'm talking about the problem in |
Got it. I think the approach in this PR would work, until Edit: to clarify, what I imagine is that with the new logic, we will process the tokens above like this: # batch 1 (5 text tokens)
pos: 0 1 2 3 4
0 1 2 3 4
# batch 2 (6 image tokens)
pos: 5 6 7 8 9 10
5 5 5 5 5 5
+ [any additional vision positions if needed]
# batch 3 (3 text tokens)
pos: 11 12 13
7 8 9 I.e the first |
That's exactly what we need! @ggerganov Actually in @FMayran 's branch we have the second line (but modified Maybe the best thing to do is wait until the new logic is ready outside |
Now Qwen3-VL is released. 2D grounding is now relative coordinates; M-RoPE becomes MRoPE-Interleave. Transfomers PR: huggingface/transformers#40795 |
I was facing a similar issue where Qwen 2.5-VL 3B was giving terrible OCR performance compared to the transformers implementation see #16334. This PR fixed this issue, would love to see it merged to mainline! |
May I ask how can we help to launch/accelerate the implementation this specific improvement to By search for the text |
I'll have a look into this next week. I'm already aware of the fact that the next position after the image must be increased by |
Hi, if my understanding is correct this PR will also be helpful for #16600. +1 for merging when ready |
Yes! But first we need to add an internal token index (for causual check only, i.e. The first line in @ggerganov illustration) that's independent of the current Also, do we have a way to seriously validate the output (e.g. with for example |
Fixes #13694