Skip to content

Conversation

rujialiu
Copy link

Fixes #13694

…tten masking will break. See ggml-org#13694. This also allows us to remove some previous workarounds
@rujialiu
Copy link
Author

This is my first time PR. Thanks for your help! @FMayran @ggerganov @ngxson

@FMayran
Copy link

FMayran commented Aug 21, 2025

The modification in llama-graph.cpp deserves a comment regarding why we add n_tokens (--> because we want to start with the second component of pos, since the first component is the linearly increasing position, irrelevant to actual positional encoding).
Same in mtmd-helper (both times), we can say that the first n_tokens of pos is always linearly increasing, and the rest is the stuff for positional encoding.

@rujialiu
Copy link
Author

rujialiu commented Aug 21, 2025 via email

@rujialiu
Copy link
Author

Added some comments as suggested by @FMayran
Thanks @broadbit-hu for this "dependency on batch size" observation and intensive testing this PR. It's now ready to review @ggerganov @ngxson

@rujialiu
Copy link
Author

rujialiu commented Aug 24, 2025

Note: this PR's output matches transformers and vllm, if the prompt is given exactly the same.

However, in current tests, the prompts for vllm is "image before text", but currently mtmd-cli inserts image after text, so the result looks slightly worse than vllm. If we move the image before text in mtmd-cli, the result is nearly identical to vllm.

For details, see:
#13694 (comment)

@broadbit-hu
Copy link

@ggerganov This PR contains an important fix, the llama.cpp has significantly improved, what do we need to do to start the approval process?

return 1; // for M-RoPE, the whole image is 1 in temporal dimension
}
return image_tokens->n_tokens();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this will cause unexpected behavior

@rujialiu
Copy link
Author

rujialiu commented Aug 25, 2025 via email

@ngxson
Copy link
Collaborator

ngxson commented Aug 25, 2025

Removing this cause whatever tokens after the image (for example, text tokens) to suddenly jumps their position.

Given example of a sequence: 10 text tokens, 5 image tokens then 1 text token

The last 1 text token should have position 10+1 = 11, not 10+5

image

@rujialiu
Copy link
Author

rujialiu commented Aug 26, 2025

Removing this cause whatever tokens after the image (for example, text tokens) to suddenly jumps their position.

Ah I understand what you mean now. Thanks.

Before my fix, removing this will lead sudden jumps yes, but after my fix, the first dimension of pos reverts to its original meaning: the auto-increasing "serial number" for each token, so it's not jumping anymore.
This allows me to remove another workaround in llama-batch.cpp's check (see my comment above)

@rujialiu
Copy link
Author

Looking at your comment again, I think I misunderstood you. So I'm replying again.

Given example of a sequence: 10 text tokens, 5 image tokens then 1 text token

The last 1 text token should have position 10+1 = 11, not 10+5

(I can't see your posted image due to proxy settings so sorry if your image contains the answer)
Why should it be 10+1 instead of 10+5? Is it llamacpp's convention or something that is followed by all other inference softwares (vllm, sglang transformers etc)? I tried, but cannot find any resource...

Actually I was also trying to preserve this behavior (repeating position for all image tokens), but cannot find a way to fix causal attention masking properly. It looks like the code assumes the positions are increasing without repetition?

@Hoernchen
Copy link

This patch is like the missing glasses for qwen vl, I was looking for vision issues because it kept refering to arms as legs at times and when pushed for detailed descriptions the whole anatomy was more lovecraftian horror than human. looks like I found the right one, that problem appears to be fixed by this patch. Continuing the chat with second message with a different image doesn't really work, but that is probably a different issue.

@rujialiu
Copy link
Author

rujialiu commented Sep 2, 2025

This patch is like the missing glasses for qwen vl, I was looking for vision issues because it kept refering to arms as legs at times and when pushed for detailed descriptions the whole anatomy was more lovecraftian horror than human. looks like I found the right one, that problem appears to be fixed by this patch. Continuing the chat with second message with a different image doesn't really work, but that is probably a different issue.

Thanks for the info! I'm not an expert in LLM (nor in llama.cpp codebase) so I'm mainly targeting vLLM/transformers parity first, then trying to understand whether it's the "correct" way.

Could you please provide more information about your use case, including the second message that you think "is a different issue"? If it's really another issue, I'll try to investigate and make another PR. @Hoernchen

@Hoernchen
Copy link

The usage was classifying the poses of people including visibility of limbs/hands. I thought it can't be that hard, right? Well.

Pose models like mediapipe/vitpose/sapiens tend to hallucinate depending on cropping/clothes and like to place imaginary hands somewhere and do not offer reliable visibility scores, so the idea was to give a "proper" vision model a shot. The vision models can be better when prompted like "find the shoulders, follow the arms, to the hands, are the hands visible/part of the image/obscured". That obviously only works if the image is not somehow garbled. It's even more obvious if you try to "help" the model using a depth image (or two images, color+depth) generated using https://github.com/DepthAnything/Depth-Anything-V2 . This is basically a quick, poor mans "segmentation" without having to rely upon actual accurate segmentation, and the color gradient the model describes around the "shoulders" does not match the image but appears to be from a different part of the image.

As for the second message, I don't know, using the llama.cpp with qwen vl works if you start a new chat with your instructions and one image to describe, but if you continue the conversation and you add another image to the second message it fails to describe it. I don't know it that is just because the model is stubborn, because the second image is garbled again, or maybe because there is no second image and the model just hallucinates completely. That is unfortunately hard to debug.

Adding multiple images to the first prompt at once appears to have issues as well, I don't recall if this was qwen or gemma3, but it led to getting descriptions of "all" 5 to 13 images - with 8 actual images! Some of those descriptions didn't match any image, some appered to be repeats.

@rujialiu
Copy link
Author

rujialiu commented Sep 3, 2025

Thanks for the explanation! I understand the basic ideas now. I wonder if you could post an example here (or send me via email) so that I can first run it with vLLM and transformers, to know whether it's a limitation of the model, or just llama.cpp's issue.

Actually the exact format of conversation (including the order of message) matters a lot, so without detailed messages, it's hard to say.

@Googulator
Copy link

@ngxson : what is currently needed for this to be merged? We're currently forced to use a build of llama.cpp built from this PR's code, eagerly awaiting for a proper release. If needed, I am willing to help.

@FMayran
Copy link

FMayran commented Sep 10, 2025

@rujialiu

Looking at your comment again, I think I misunderstood you. So I'm replying again.

Given example of a sequence: 10 text tokens, 5 image tokens then 1 text token
The last 1 text token should have position 10+1 = 11, not 10+5

(I can't see your posted image due to proxy settings so sorry if your image contains the answer) Why should it be 10+1 instead of 10+5? Is it llamacpp's convention or something that is followed by all other inference softwares (vllm, sglang transformers etc)? I tried, but cannot find any resource...

Actually I was also trying to preserve this behavior (repeating position for all image tokens), but cannot find a way to fix causal attention masking properly. It looks like the code assumes the positions are increasing without repetition?

I just had a look inside transformer's implementation
inside function Qwen2_5_VLModel.get_rope_index()

What it does is the following for text located after an image: token_pos = llm_pos_ids_list[-1].max() + 1, where llm_pos_ids_list[-1] is essentially [expand(0..t), expand(0..h), expand(0..w)] + previous_token_pos. For images, the time part is just a tensor of zeros. For videos, it is frame_nb * seconds_per_frame * config.vision_config.tokens_per_second.

In other words, first text pos = max(h,w,t) + previous_token_pos. This is consistent with their documentation of the function " Here we calculate the text start position_ids as the max vision position_ids plus 1." but not what @ngxson said

So, even though it looks wrong, transformers'implementation is +max(h,w,t)

Edit:
What we want is for the causal mask to behave "as if" there was a linearly increasing pos, but without setting these pos values in the kv cache, to match with the transformers implementation. Right now, I have trouble understanding how the mask is computed in llama_kv_cache::set_input_kq_mask(), as the pos of an element of the batch p1 is compared to what seems to be a value of the cache p0 (but not to other pos of the current batch)

@rujialiu
Copy link
Author

In other words, first text pos = max(h,w,t) + previous_token_pos. This is consistent with their documentation of the function " Here we calculate the text start position_ids as the max vision position_ids plus 1." but not what @ngxson said

Yes! It's described in Qwen2-VL's paper, section 2.1:

In scenarios where the model’s input encompasses multiple modalities, position numbering for each modality is initialized by incrementing the maximum position ID of the preceding modality by one

However, it's not in Qwen2.5-VL's paper so I forgot it. I only read Qwen2-VL's paper once very quickly to "get a basic idea", and Qwen2.5-VL's paper several times during implementation. Thanks for the reminder!

Edit: What we want is for the causal mask to behave "as if" there was a linearly increasing pos, but without setting these pos values in the kv cache, to match with the transformers implementation. Right now, I have trouble understanding how the mask is computed in llama_kv_cache::set_input_kq_mask(), as the pos of an element of the batch p1 is compared to what seems to be a value of the cache p0 (but not to other pos of the current batch)

Maybe we just need to change mtmd_image_tokens_get_n_pos to return max(h,w,t) and change the "sanity check" (maybe) later on? But unfortunately I'm quite busy these days. Will try to do this within a few days.

@FMayran
Copy link

FMayran commented Sep 11, 2025

@rujialiu
I was thinking the same. The sanity check is tricky because the link between position in batch and position in kv cache is lost at his point, but we may be able to cheese it if we can recover where the batch is located within the kv cache.
This could be made possible by preserving the slot_info structure as a new property of the ubatch in the call to llama_kv_cache::apply_ubatch for instance.

@FMayran
Copy link

FMayran commented Sep 11, 2025

OK, I have a demonstration of a working fix which preserves token pos numbering, by adding a map tu ubatch to link the ubatch token position and position in the kv cache. It is ugly but it works. Can someone (@ngxson @rujialiu) check if it is fine for you?
I can confirm that it works even with multiple images submitted to the network and requests like: find the location in the second image of the object in the first image. I am not creating a new pull request, as this code is intended for demonstration purposes and may break some other stuff.

my fork:
https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix

@rujialiu
Copy link
Author

my fork: https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix

I don't have time to test it thoroughly but after reading the code I'd say I like your fork more than my own PR 😄 @FMayran
It's a general solution that works with any "positional trick" like m-rope and doesn't require any special treatment for a particular one.

@rujialiu
Copy link
Author

@Googulator Could you test @FMayran 's fork above? It looks quite promising and I wish it would be used instead of mine.

@Googulator
Copy link

Tested @FMayran 's vs @rujialiu 's vs the mainline version:

$ cd ~/llama.cpp-fmayran  # commit 9a8e8813311cd8b6aeb11d7e527c794b50876071
$ ./bin/llama-mtmd-cli -m ../Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf --mmproj ../mmproj-BF16.gguf --image rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 1 | tee ../fmayran-b1.txt
[...]
´´´json
[
        {"bbox_2d": [173, 687, 484, 859], "label": "rectangle"},
        {"bbox_2d": [317, 518, 621, 794], "label": "rectangle"},
        {"bbox_2d": [723, 707, 810, 776], "label": "rectangle"}
]
´´´
$ ./bin/llama-mtmd-cli -m ../Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf --mmproj ../mmproj-BF16.gguf --image rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 2048 | tee ../fmayran-b2048.txt
[...]
´´´json
[
        {"bbox_2d": [173, 687, 484, 859], "label": "rectangle"},
        {"bbox_2d": [317, 518, 621, 794], "label": "rectangle"},
        {"bbox_2d": [723, 705, 811, 778], "label": "rectangle"}
]
´´´
**Answer:** There are 3 rectangles in the image.
$ cd ~/llama.cpp-rujialiu  # commit 7db161a8b6673b2269f815f06b176c4c67b0ed54
$ ./bin/llama-mtmd-cli -m ../Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf --mmproj ../mmproj-BF16.gguf --image rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 1 | tee ../rujialiu-b1.txt
[...]
´´´json
[
        {"bbox_2d": [176, 548, 484, 859], "label": "rectangle"},
        {"bbox_2d": [317, 518, 594, 762], "label": "rectangle"},
        {"bbox_2d": [740, 707, 811, 773], "label": "rectangle"}
]
´´´
$ ./bin/llama-mtmd-cli -m ../Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf --mmproj ../mmproj-BF16.gguf --image rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 2048 | tee ../rujialiu-b2048.txt
[...]
´´´json
[
        {"bbox_2d": [185, 686, 488, 863], "label": "rectangle"},
        {"bbox_2d": [316, 516, 624, 797], "label": "rectangle"},
        {"bbox_2d": [724, 707, 813, 776], "label": "rectangle"}
]
´´´
$ cd ~/llama.cpp-mainline # commit 304ac5693d1e2124f83e0584bc5eea6311d3d3b4
$ ./bin/llama-mtmd-cli -m ../Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf --mmproj ../mmproj-BF16.gguf --image rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 1 | tee ../mainline-b1.txt
[...]
´´´json
[
        {"bbox_2d": [173, 658, 484, 833], "label": "rectangle"},
        {"bbox_2d": [317, 520, 603, 793], "label": "rectangle"},
        {"bbox_2d": [743, 705, 810, 773], "label": "rectangle"}
]
´´´
<|im_start|>
$ ./bin/llama-mtmd-cli -m ../Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf --mmproj ../mmproj-BF16.gguf --image rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 2048 | tee ../mainline-b2048.txt
[...]
´´´json
[
        {"bbox_2d": [185, 761, 488, 1004], "label": "rectangle"},
        {"bbox_2d": [319, 744, 624, 984], "label": "rectangle"},
        {"bbox_2d": [723, 817, 813, 888], "label": "rectangle"}
]
´´´

(note: backticks in the output were changed to forward ticks to avoid breaking GitHub's formatting)

@Googulator
Copy link

Observations:

  • As expected, coordinates with mainline are all over the place. Even with -b 1, it isn't perfect, and it completely falls apart with -b 2048. Interestingly, with this smaller 3B model, the X-dimension seemingly isn't compressed, but rather stretched.
  • @rujialiu 's branch works well with -b 2048, but breaks with -b 1.
  • @FMayran 's branch gives consistent, though not perfectly identical output, with different batch sizes. Consistency between @rujialiu 's and @FMayran 's branch is also good, except in the case where @rujialiu 's branch breaks.

@rujialiu
Copy link
Author

Thanks @Googulator
It's a bit unexpected that my PR breaks with -b 1 in your test. I only tested 7B model because 3B model is not good enough for me even with vllm. Could you check whether -b 1 only breaks for 3B, or also breaks in 7B model with your input? Anyway, It's probably an indication that I overlooked something.

Also, could you check FMayran's with --temp 0? I'm curious why "not perfectly identical".

BTW: Are you testing my branch as-is (which is based on an earlier version of llama.cpp), or applied my changes to the current master branch?

@broadbit-hu
Copy link

broadbit-hu commented Sep 15, 2025

@FMayran @rujialiu Thanks for you work again!

I've tested the QwenVL 2.7 7B Q8_0 with the CPU version, just like @Googulator, using temp 0.0, see the results:

Reference vLLM results:

[
        {"bbox_2d": [195, 678, 460, 837], "color": "red"},
        {"bbox_2d": [312, 547, 508, 792], "color": "green"},
        {"bbox_2d": [629, 708, 700, 775], "color": "black"}
]

rujialiu version (#15474):

  • Batch size = 1:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 1
[
	{"bbox_2d": [168, 679, 434, 837], "label": "red rectangle"},
	{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
	{"bbox_2d": [599, 708, 672, 775], "label": "black rectangle"}
]
  • Batch size = 2048:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 2048
[
	{"bbox_2d": [168, 679, 433, 836], "label": "red rectangle"},
	{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
	{"bbox_2d": [602, 708, 671, 775], "label": "black rectangle"}
]

FMayran version (https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix):

Some compiler warnings here:

git clone https://github.com/FMayran/llama.cpp --branch QwenVL-causal-fix
cd llama.cpp

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j 16
[ 24%] Building CXX object src/CMakeFiles/llama.dir/llama-mmap.cpp.o
/home/collaigue/test/mtmd/llama.cpp-fmayran/src/llama-batch.cpp: In member function ‘bool llama_batch_allocr::init(const llama_batch&, const llama_vocab&, const llama_memory_i*, uint32_t, uint32_t, bool)’:
/home/collaigue/test/mtmd/llama.cpp-fmayran/src/llama-batch.cpp:227:9: warning: missing initializer for member ‘llama_ubatch::kv_position_of_token’ [-Wmissing-field-initializers]
  227 |         };
      |         ^
/home/collaigue/test/mtmd/llama.cpp-fmayran/src/llama-batch.cpp: In member function ‘llama_ubatch llama_batch_allocr::ubatch_reserve(uint32_t, uint32_t’:
/home/collaigue/test/mtmd/llama.cpp-fmayran/src/llama-batch.cpp:402:5: warning: missing initializer for member ‘llama_ubatch::kv_position_of_token’ [-Wmissing-field-initializers]
  402 |     };
      |     ^
/home/collaigue/test/mtmd/llama.cpp-fmayran/src/llama-batch.cpp: In member function ‘llama_ubatch llama_batch_allocr::ubatch_add(const std::vector<int>&, uint32_t, bool)’:
/home/collaigue/test/mtmd/llama.cpp-fmayran/src/llama-batch.cpp:723:5: warning: missing initializer for member ‘llama_ubatch::kv_position_of_token’ [-Wmissing-field-initializers]
  723 |     };
      |     ^
  • Batch size = 1:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 1
[
	{"bbox_2d": [168, 679, 462, 837], "color": "red"},
	{"bbox_2d": [284, 575, 480, 792], "color": "green"},
	{"bbox_2d": [601, 708, 672, 775], "color": "black"}]
  • Batch size = 2048:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 2048
[
	{"bbox_2d": [168, 679, 462, 837], "color": "red"},
	{"bbox_2d": [284, 575, 480, 792], "color": "green"},
	{"bbox_2d": [601, 708, 672, 775], "color": "black"}
]

There are no significant differences in the results, but 'label' vs 'color' is a little bit unexpected.


Another test using F16 GGUF:

rujialiu version (#15474):

./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-F16.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 2048
[
	{"bbox_2d": [168, 679, 434, 837], "label": "red rectangle"},
	{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
	{"bbox_2d": [602, 708, 672, 775], "label": "black rectangle"}
]

Another test using AMD 7900XTX GPU (ROCm 6.3):

  • Q8_0:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 2048 -ngl 99
[
	{"bbox_2d": [168, 679, 433, 836], "label": "red rectangle"},
	{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
	{"bbox_2d": [599, 708, 672, 775], "label": "black rectangle"}
]
  • F16:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-F16.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 2048 -ngl 99
[
	{"bbox_2d": [168, 679, 434, 865], "label": "red rectangle"},
	{"bbox_2d": [284, 575, 480, 792], "label": "green rectangle"},
	{"bbox_2d": [599, 708, 672, 803], "label": "black rectangle"}
]

FMayran version (https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix):

./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-F16.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.0 -c 2048 -b 2048
[
	{"bbox_2d": [168, 679, 462, 837], "color": "red"},
	{"bbox_2d": [284, 575, 480, 792], "color": "green"},
	{"bbox_2d": [601, 708, 672, 775], "color": "black"}
]

Version Batch Size bbox_2d-1 bbox_2d-2 bbox_2d-3
vLLM Reference (BF16) 1 [195, 678, 460, 837] [312, 547, 508, 792] [629, 708, 700, 775]
rujialiu v15474 (Q8_0) 1 [168, 679, 434, 837] [284, 575, 480, 792] [602, 708, 672, 775]
rujialiu v15474 (F16) 1 [168, 679, 434, 837] [284, 575, 480, 792] [602, 708, 672, 775]
rujialiu v15474 (Q8_0) 2048 [168, 679, 433, 836] [284, 575, 480, 792] [602, 708, 671, 775]
rujialiu v15474 (F16) 2048 [168, 679, 434, 865] [284, 575, 480, 792] [599, 708, 672, 803]
rujialiu v15474 (Q8_0) AMD 7900XTX GPU (ROCm 6.3) 2048 [168, 679, 433, 836] [284, 575, 480, 792] [599, 708, 672, 775]
rujialiu v15474 (F16) AMD 7900XTX GPU (ROCm 6.3) 2048 [168, 679, 434, 865] [284, 575, 480, 792] [599, 708, 672, 803]
FMayran QwenVL-causal-fix (Q8_0) 1 [168, 679, 462, 837] [284, 575, 480, 792] [601, 708, 672, 775]
FMayran QwenVL-causal-fix (Q8_0) 2048 [168, 679, 462, 837] [284, 575, 480, 792] [601, 708, 672, 775]
FMayran QwenVL-causal-fix (F16) 2048 [168, 679, 462, 837] [284, 575, 480, 792] [601, 708, 672, 775]

Notes:

  • The "label" vs "color" discrepancy between "rujialiu" and "FMayran" versions appears to be a minor difference in how the results are labeled.

@rujialiu
Copy link
Author

@broadbit-hu Thanks for the tests again! Did you change llama-mtmd-cli's image/text relative order to match the request sent to vllm, as I said in #13694 ? I my testing that has something to do with "color"/"label". But still, it's strange that batch=1 is different. In my test it should be always identical to batch=2048 :(

@broadbit-hu
Copy link

broadbit-hu commented Sep 15, 2025

llama.cpp-rujialiu$ git show -s
commit 54aa805ba329523410ca9e178b889c14cf97ae87 (HEAD -> mrope-fix)
Author: Rujia Liu <rujia.liu@qq.com>
Date:   Thu Aug 21 21:11:36 2025 +0800

    Append mrope positions to the traditional llm pos, otherwise causal atten masking will break. See #13694. This also allows us to remove some previous workarounds

@broadbit-hu Thanks for the tests again!
Did you change llama-mtmd-cli's image/text relative order to match the request sent to vllm, as I said in #13694 ?

I changed the order of the image and prompt arguments in the command line, but I think it doesn’t matter (same results).

I my testing that has something to do with "color"/"label".

Does your test differ from mine?

3 weeks ago:

[
	{"bbox_2d": [168, 679, 433, 836], "label": "red rectangle"},
	{"bbox_2d": [284, 547, 480, 792], "label": "green rectangle"},
	{"bbox_2d": [599, 708, 672, 775], "label": "black rectangle"}
]
[
	{"bbox_2d": [166, 678, 462, 838], "color": "red"},
	{"bbox_2d": [310, 546, 509, 794], "color": "green"},
	{"bbox_2d": [629, 706, 700, 802], "color": "black"}
]

But still, it's strange that batch=1 is different. In my test it should be always identical to batch=2048 :(

There are no significant differences in the results of batch size 1 vs 2048 in my test, there's no problem with temperature 0.

@broadbit-hu
Copy link

broadbit-hu commented Sep 15, 2025

The non‑zero (e.g., 0.6) temperature produces very strange coordinates with @rujialiu's version:

rujialiu version (#15474):

  • Batch size = 2048
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 2048
[
	{"bbox_2d": [628, 703, 700, 777], "label": "rectangle"},
	{"bbox_2d": [167, 677, 436, 838], "label": "rectangle"},
	{"bbox_2d": [283, 545, 481, 793], "label": "rectangle"}
]
  • Batch size = 1
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 1
[
	{"bbox_2d": [169, 680, 436, 837], "label": "red rectangle"},
	{"bbox_2d": [284, 576, 481, 796], "label": "green rectangle"},
	{"bbox_2d": [598, 708, 674, 777], "label": "black rectangle"}
]

FMayran version (https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix):

  • CPU:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 2048
[
	{"bbox_2d": [169, 674, 464, 840], "color": "red"},
	{"bbox_2d": [311, 573, 508, 795], "color": "green"},
	{"bbox_2d": [604, 706, 671, 775], "color": "black"}
]
  • GPU:
./build/bin/llama-mtmd-cli -m ../models/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --image ../images/rectangles1024_flip.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 99 --temp 0.6 -c 2048 -b 2048 -ngl 99
[
	{"bbox_2d": [169, 674, 461, 838], "color": "red"},
	{"bbox_2d": [285, 575, 481, 795], "color": "green"},
	{"bbox_2d": [599, 708, 671, 776], "color": "black"}
]

@rujialiu
Copy link
Author

I changed the order of the image and prompt arguments in the command line, but I think it doesn’t matter (same results).

Yeah, changing the order of cli parameter is not helpful. The actual order is hard-coded into the code.

You need to follow my instruction #13694 (comment)

Does your test differ from mine?

Yes, because I changed the code 😄

@rujialiu
Copy link
Author

rujialiu commented Sep 16, 2025

The non‑zero (e.g., 0.6) temperature produces very strange coordinates with @rujialiu's version:

Oh! That's not good. I definitely will look into this. I hope just adding the max(w,h,t) logic will resolve this.

BTW: Colors are lost, but coordinates are reasonable (it's a permutation 😄 )

@rujialiu
Copy link
Author

Ok, I've thought about it carefully. If we aim to 100% reproduce transformer's/vllm/official paper's behavior, then @FMayran 's approach seems mandatory because the "postion" values sent to LLM is not increasing.

Here is an example. Suppose we have a 2x2 (merged token count) image, and there are ~100 tokens before the image, then the position values are:

98, 99, 100, 101, 102, 103, 104, 102, 103, 104, ...
  • 98, 99, 100 is from the previous text
  • 101, 102, 103, 104 is the image
  • 102, 103, 104... is the next text. The starting position is 100+max(2,2)=102

Now it's clear. We cannot use this non-increasing sequence values to check for causuality. My PR reduced the incorrect causual attention mask, but did not fully eliminate them.

Conclusion: I propose to use @FMayran 's solution in favour of mine. However, we'd better to figure out a way to avoid changing ubatch, or at least in a more gentle way. Now I'm feeling more knowledgable. Could you @ngxson give some advice when you're less busy?

@ngxson
Copy link
Collaborator

ngxson commented Sep 16, 2025

In other words, first text pos = max(h,w,t) + previous_token_pos. This is consistent with their documentation of the function " Here we calculate the text start position_ids as the max vision position_ids plus 1." but not what @ngxson said

So, even though it looks wrong, transformers'implementation is +max(h,w,t)

My knowledge about this part was from the illustration from Qwen's paper, so I could be wrong. max(h,w) should make more sense as the h and w components are all rotated by h and w amount respectively, so the next embedding (aka next token) should be rotate outside of this range, hence previous_token_pos + max(h,w)

But in anyway, the next token should not be previous_token_pos + h*w as I already tested this and it give incoherent results.

Comment on lines -1027 to -1029
if (image_tokens->use_mrope_pos) {
return 1; // for M-RoPE, the whole image is 1 in temporal dimension
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on what discussed so far, here we should return max(h, w)

However, I don't know how it will impact the rest of the code. Make sure to test it.

@rujialiu
Copy link
Author

rujialiu commented Sep 16, 2025

hmmm... I think I have come up with an ugly way to avoid changing ubatch. In my PR, I already changed pos from 4D (just m-rope) to 5D (regular pos + m-rope). I just need to change it to 6D (regular pos + chunk_id + m-rope), I'll probably be able to use it in causual attention mask calculation. Is it worth trying? @FMayran @ngxson

Edit: Failed. Still need to change ubatch.

@FMayran
Copy link

FMayran commented Sep 16, 2025

hmmm... I think I have come up with an ugly way to avoid changing ubatch. In my PR, I already changed pos from 4D (just m-rope) to 5D (regular pos + m-rope). I just need to change it to 6D (regular pos + chunk_id + m-rope), I'll probably be able to use it in causual attention mask calculation. Is it worth trying? @FMayran @ngxson

Edit: Failed. Still need to change ubatch.

That was actually my first solution, until I realised that pos in the cache has two meanings:

  1. a strictly increasing index for causal matching, which IMO should be viewed as the "age" of the token, or its order of insertion into the cache. This should be strictly increasing.
  2. the base position for positional encoding. This should be increasing between modalities, but not strictly (and maybe it does not need to be increasing at all).

In the code, "n_past" clearly assumes that the number of tokens past (the "age") is also the base position for positional encoding, which is wrong, at least for Qwen2.5VL. The mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) function is just a hotfix of that meaning, but can't solve the underlying issue, as it does not, and cannot, fix both meanings. We either break causal masking, or break encoding of the next chunk.

Also, the cache's pos is a single value per token, as opposed to multiple, so no solution based on having multiple pos per token in the ubatch will fix that (we lost access to multiple pos in the ubatch at this point because of the lack of linkage between ubatch and cache which I was forced to add in my fix). If we add the linkage like I did, then pos is not necessary anymore :-).

More generally, I think it is important to keep the link between tokens in the ubatch and these tokens inserted in the cache until all these tokens have been decoded, not only for Qwen but for all models with positional encoding. Indeed, causal masking should only depend on the position in the ubatch, not on positionel encoding, and not on anything other than the ubatch in the cache (the meaning is to mask future tokens, these necessarily come from the ubatch we are processing unless we want to rewrite the past computations).

@FMayran
Copy link

FMayran commented Sep 16, 2025

@broadbit-hu warnings due to struct initialization should be gone in my new commit. The kv_position_of_token should probably be moved to the "data" class for storage, with preallocation instead of my lazy-allocation scheme, but this is not a serious matter.

@rujialiu
Copy link
Author

rujialiu commented Sep 16, 2025

@ggerganov After several weeks of discussions and joint effort from many people, here is the summary of the current situation to save your time. No need to check previous messages in this PR (except @FMayran 's last long message because it didn't exist when I was writing this):

The root cause is that the meaning of the word "position" is "polymophic" and multi-functional:

  1. it's the value to calculate positional embedding
  2. it's the number of tokens before the current one (call it token_idx), so, it's strictly increasing, and values are always consecutive
  3. it's a "causual value" so can be used to check causuality.

However, in MLLM point 2 and 3 above are no longer true: it can be repeating and "jumping".

In this PR, I tried to maintain a linearly-increasing pos sequence for M-rope (so it becomes 5D under the hood) and use it for causual attention mask calculation. The result seems to be much better (see all the tests above), but still incorrect, according to the paper and transformers implementation: after an image chunk, the next text token's position should be previous_token_pos + max(h,w). However, if I incororate this logic, the position values are inevitably "broken" and cannot be used for causuality check anymore.

@FMayran's code produces similar and sometimes better result than mine, and it looks likea proper solution, except that it adds a temporary vector to ubatch, which should be avoided.

Also, I wonder whether some kind of internal assumption (increasing? unique?) of kv cache is already violated? Should kv cache use "strict increasing position" instead?

Anyway, here is the only way (if we can't change ubatch) I can think of, is quite non-trivial and error-prone:

  • Just like n_past, we need to know the number tokens so far during the whole inference in order to calculate token_idx, so we need to add a parameter to related functions
  • Add another dimension in batch's pos and store token_idx in it (m-rope becomes 5D as it already is in my PR, but text tokens in MLLM are also 2D now.)
  • ubatch and kv cache both uses token_idx, so causual attention mask is automatically using token_idx.

Can I take this route? Or you prefer @FMayran's approach (which is much cleaner and simpler) but try to make the change more "future-proof" (e.g. add a genral-purpose auxiliary structure in ubatch that can be modified in some future difficult situation).

Now I think we know what logic to implement, the real challenge is how to do it with llama.cpp codebase in the best way.

@rujialiu
Copy link
Author

In the code, "n_past" clearly assumes that the number of tokens past (the "age") is also the base position for positional encoding, which is wrong, at least for Qwen2.5VL. The mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) function is just a hotfix of that meaning, but can't solve the underlying issue, as it does not, and cannot, fix both meanings. We either break causal masking, or break encoding of the next chunk.

Yes! That is very annoying.

Also, the cache's pos is a single value per token, as opposed to multiple, so no solution based on having multiple pos per token in the ubatch will fix that (we lost access to multiple pos in the ubatch at this point because of the lack of linkage between ubatch and cache which I was forced to add in my fix). If we add the linkage like I did, then pos is not necessary anymore :-).

Exactly!

More generally, I think it is important to keep the link between tokens in the ubatch and these tokens inserted in the cache until all these tokens have been decoded, not only for Qwen but for all models with positional encoding. Indeed, causal masking should only depend on the position in the ubatch, not on positionel encoding, and not on anything other than the ubatch in the cache (the meaning is to mask future tokens, these necessarily come from the ubatch we are processing unless we want to rewrite the past computations).

Definitely!

BTW: I was writing a long reply (see above) before seeing your message. So it may seem to be repeating your idea. But it's actually because we're thinking similarly 😄

@ggerganov
Copy link
Member

Could you write down a specific example of positions and indices that we need to track, so I can understand better what is the current limitation of the llama_batch?

For example: 3 text tokens, followed by 5 image embedding followed by 4 text tokens.

@rujialiu
Copy link
Author

Thanks! @ggerganov
Here is an example with 5 text tokens, 6=2*3 image tokens, then 3 text tokens. Image's width and height are 2 and 3, so the second text chunk's starting pos 4+max(w,h)=4+3=7

T:1 T:2 T:3 T:4 I:5 I:5 I:5 I:5 I:5 I:5 T:7 T:8 T:9

@ggerganov
Copy link
Member

Thanks. I still don't understand why this is incompatible with the existing logic. We can process this using 3 separate batches:

batch 1: pos [0 ... 4]
batch 2: pos [5,5,5,5,5,5]
batch 3: pos [7 ... 9]

@rujialiu
Copy link
Author

Thanks. I still don't understand why this is incompatible with the existing logic. We can process this using 3 separate batches:

batch 1: pos [0 ... 4]
batch 2: pos [5,5,5,5,5,5]
batch 3: pos [7 ... 9]

Yes, but within the image chunk, when calculating attention mask, in function llama_kv_cache::set_input_kq_mask:

                const llama_pos p1 = ubatch->pos[i];

                const uint64_t idst = n_kv*(h*n_stream*n_tps_pad + s*n_tps_pad + ii);

                for (uint32_t j = 0; j < n_kv; ++j) {
                    if (cells.is_empty(j)) {
                        continue;
                    }

                    // mask the token if not the same sequence
                    if (!cells.seq_has(j, seq_id)) {
                        continue;
                    }

                    const llama_pos p0 = cells.pos_get(j);
                    // mask future tokens
                    if (causal_attn && p0 > p1) {
                        continue;

suppose p1 represents the first image token and p0 represents the second image token (a future token), since p0==p1==5, it will not go into continue.

(I'm talking about the problem in master, not in my PR. My PR avoids this but has violates the max(w,h,t) rule so the result is suboptimal)

@ggerganov
Copy link
Member

ggerganov commented Sep 17, 2025

Got it. I think the approach in this PR would work, until llama_batch is improved in the future. But you have to make sure that none of the other vision models are broken and update if necessary. Also, I think that llama-server logic around server_tokens specifically would be possible to simplify using this new convention for the llama_batch.pos - but haven't thought too deeply about this yet.

Edit: to clarify, what I imagine is that with the new logic, we will process the tokens above like this:

# batch 1 (5 text tokens)
pos: 0 1 2 3 4
     0 1 2 3 4

# batch 2 (6 image tokens)
pos: 5 6 7 8 9 10
     5 5 5 5 5 5
   + [any additional vision positions if needed]

# batch 3 (3 text tokens)
pos: 11 12 13
     7  8  9

I.e the first n_tokens positions are always increasing and represent the position of the token in the sequence - used for causal masking. The second n_tokens positions are the positions of the tokens/embeddings in the embedding space - used in the RoPE.

@rujialiu
Copy link
Author

Got it. I think the approach in this PR would work, until llama_batch is improved in the future. But you have to make sure that none of the other vision models are broken and update if necessary. Also, I think that llama-server logic around server_tokens specifically would be possible to simplify using this new convention for the llama_batch.pos - but haven't thought too deeply about this yet.

Edit: to clarify, what I imagine is that with the new logic, we will process the tokens above like this:

# batch 1 (5 text tokens)
pos: 0 1 2 3 4
     0 1 2 3 4

# batch 2 (6 image tokens)
pos: 5 6 7 8 9 10
     5 5 5 5 5 5
   + [any additional vision positions if needed]

# batch 3 (3 text tokens)
pos: 11 12 13
     7  8  9

I.e the first n_tokens positions are always increasing and represent the position of the token in the sequence - used for causal masking. The second n_tokens positions are the positions of the tokens/embeddings in the embedding space - used in the RoPE.

That's exactly what we need! @ggerganov

Actually in @FMayran 's branch we have the second line (but modified ubatch), in my PR we have the first line (thus lost the max(w,h,t) logic).

Maybe the best thing to do is wait until the new logic is ready outside llama-server and then implement a really proper fix? Either @FMayran or I should be able to do this.

@rujialiu
Copy link
Author

Now Qwen3-VL is released. 2D grounding is now relative coordinates; M-RoPE becomes MRoPE-Interleave.
Since we have already spent quite a lot of time with Qwen2.5-VL, it might be a good idea to implement (or assist some other people to implement) Qwen3-VL and make sure grounding is working perfectly. @FMayran

Transfomers PR: huggingface/transformers#40795

@AbdullahMPrograms
Copy link

AbdullahMPrograms commented Sep 29, 2025

I was facing a similar issue where Qwen 2.5-VL 3B was giving terrible OCR performance compared to the transformers implementation see #16334.

This PR fixed this issue, would love to see it merged to mainline!

@rujialiu
Copy link
Author

rujialiu commented Oct 18, 2025

Got it. I think the approach in this PR would work, until llama_batch is improved in the future. But you have to make sure that none of the other vision models are broken and update if necessary. Also, I think that llama-server logic around server_tokens specifically would be possible to simplify using this new convention for the llama_batch.pos - but haven't thought too deeply about this yet.

May I ask how can we help to launch/accelerate the implementation this specific improvement to llama_batch? @ggerganov

By search for the text st_idx = llm_pos_ids_list[-1].max() in transformers library, it looks like GLM-4V, Qwen2.5-VL, Qwen2.5-Omni, Qwen3-VL, Qwen3-Omni are all using the max(t,h,w) logic. There are already onging work to support GLM-4V (#16600) and Qwen3-VL (#16207), and they will eventually face the same issue we have here.

@ngxson
Copy link
Collaborator

ngxson commented Oct 18, 2025

I'll have a look into this next week. I'm already aware of the fact that the next position after the image must be increased by max(h,w), as explained in #15474 (comment)

@ddh0
Copy link
Contributor

ddh0 commented Oct 18, 2025

Hi, if my understanding is correct this PR will also be helpful for #16600. +1 for merging when ready

@rujialiu
Copy link
Author

I'll have a look into this next week. I'm already aware of the fact that the next position after the image must be increased by max(h,w), as explained in #15474 (comment)

Yes! But first we need to add an internal token index (for causual check only, i.e. The first line in @ggerganov illustration) that's independent of the current position id (feed into LLM, increase by max(h,w) etc). Then we can do whatever we need to faithfully implement any model-specific logic.

Also, do we have a way to seriously validate the output (e.g. with for example transformers)? One of the things I learnt along the way, is that even if the implementation is flawed, the model can still give good responses which may be even good enough for many people. We cannot be sure about whether the implementation is correct only by inspecting model responses with some random/popular questions. In the long run, a good validation framework would be benefical for implementing future multi-modal models. It can also help us to make sure "no other models are brokens" when implementing new ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Qwen2.5-VL-7B-Instruct returns extremely inaccurate bbox coordinates

9 participants