-
Notifications
You must be signed in to change notification settings - Fork 13.6k
mtmd: refactor preprocessing + support max/min pixels #16878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
All tests passed |
What command and image to use for this test? |
|
@ggerganov you can use the command here: #16825 the script for overlaying bbox (for visual inspection) is also included: https://gist.github.com/ngxson/039024fb2bdaf2e3c15db702f9fddaff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
|
In my Qwen3-VL-8b testing, it seems this branch decreases the accuracy: #16842 (comment) |
tools/mtmd/clip.cpp
Outdated
| const clip_image_size scaled_size = img_tool::calc_size_preserved_ratio( | ||
| original_size, | ||
| 1, // no need to align here since we will composite onto canvas | ||
| std::min(canvas.nx, canvas.ny)); // fit into the canvas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
| std::min(canvas.nx, canvas.ny)); // fit into the canvas | |
| std::max(canvas.nx, canvas.ny)); // fit into the canvas |
|
Is this missing a factor of diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
index 355b42435..62e64040e 100644
--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
@@ -4268,7 +4268,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
const auto & params = ctx->model.hparams;
const int n_total = clip_n_output_tokens(ctx, img);
if (ctx->proj_type() == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN25VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN3VL) {
- return img->nx / (params.patch_size * 2) + (int)(img->nx % params.patch_size > 0);
+ return img->nx / (params.patch_size * 2) + (int)(img->nx % (params.patch_size * 2) > 0);
}
return n_total;
}
@@ -4276,7 +4276,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
const auto & params = ctx->model.hparams;
if (ctx->proj_type() == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN25VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN3VL) {
- return img->ny / (params.patch_size * 2) + (int)(img->ny % params.patch_size > 0);
+ return img->ny / (params.patch_size * 2) + (int)(img->ny % (params.patch_size * 2) > 0);
}
return 1;
} |
This comment was marked as outdated.
This comment was marked as outdated.
|
The code above doesn't have The token count should also be correct: def align_next_multiple(value, multiple):
return ((value + multiple - 1) // multiple) * multiple
w = 320
h = 240
patch_size = 16
n_tok_x = align_next_multiple(w, patch_size*2) // (patch_size*2)
n_tok_y = align_next_multiple(h, patch_size*2) // (patch_size*2)
print(f"n_tok_x: {n_tok_x}, n_tok_y: {n_tok_y}, n_tok: {n_tok_x * n_tok_y}")
# print: n_tok_x: 10, n_tok_y: 8, n_tok: 80 |
|
@tarruda the last commit should gives a pretty much accurate result even with small input image, please give it a try
|
|
Looking better now @ngxson :
Comparing with my previous run, it is a bit worse in this invoice example because the "Quantity" column boxes seems a bit off to the right, but it is better on another example I have that is more complex than this invoice (can't share because it is a customer's screenshot, but basically an AS400 terminal with a lot more text). Here's a computer desktop example:
And a nature picture:
Overall the bounding boxes on this branch look solid, but I can't say for sure that this is better or worse than what's on master because it had similar performance on the few examples I tested. |
|
I ran the test again and seems like all models are passed. This PR should now be ready. Could you do a review @ggerganov ? For the custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been testing quite a bit with the Qwen3 VL models and I think they work OK. I am not sure how to test the Qwen2.5 VL models since your script does not support absolute coordinates.
Based on the the discussions and tests in #16842, here are few recommendations about mtmd that I think would be nice to do in the future:
- The default number of minimum tokens for Qwen3 VL should be larger in order to work correctly with grounding tasks. Having 80 tokens vs 1024 image tokens is almost negligible for modern hardware (especially with #16837), so I don't think it's worth sparing tokens. There could be an option to change that (for example for edge/mobile use cases), but this should not be the default IMO
libmtmdcould use some more detailed logging and debugging options in order to be able to trace what happens with the images (preprocessing, patch sizes, number of tokens, etc.)- The OCR tests seem really useful - we should have a version controlled script in the
examplesthat can work both with absolute and relative coordinates
Agree, I think some kind of "cookbook" like what Qwen has could serves as both a test script + a demo |
|
Reported extreme slowdown after this PR was merged into main, #17012. I understand this PR was to give users the freedom to choose the max size/tokens for processed images. That said, something else was done that impacts performance significantly. When I match the old max size, I get multiple times slower process. See my comment here. Wondering if this is Vulkan backend specific. Note that all builds prior to b6915 don't have this behavior. |




Fix #16842 Fix #13077
Implement "smart resize", aka
max_pixels/min_pixelsfor preprocessing. This will now allow us to limit the input by number of tokens, no more hard limit on image height/width. This is particularly useful if you have an image which have large height but small width.The preprocessing of QwenVL model is also improved. We now only pad the bottom-right corner of the image, so the x/y coordinate of elements stay unchanged. This should improve the accuracy of bbox application.