Skip to content

Conversation

@SmartestWashingMachine
Copy link
Contributor

Noticed that a Q4_1 Qwen3-VL 2B mmproj reserved a surprisingly large amount of memory for one of its compute buffers during the image warmup part - larger than the model itself!

Looking at the code, it seems that image warmup sizes are hard-coded for different models. For Qwen3-VL it's 2116 tokens, or an image with 1472 x 1472 dimensions. If I understand correctly, llama-cpp initially reserves memory proportional to the size of that warmup image.

But some users may be certain that their images will never exceed some dimensions (e.g: OCR'ing single lines of text or their preprocessing pipeline caps images at 512 x 512), so they may want a smaller max image warmup size, reducing the memory consumption.

A few hundred MB cut down doesn't sound like much on its own, but it might help for development on the edge.

Before (Initial behavior)

highmem

With --image-warmup-tokens 256

lowmem

@ngxson
Copy link
Collaborator

ngxson commented Dec 1, 2025

I'd prefer having a warmup option like in libllama. image-warmup-tokens is quite low-level I think, it probably not very future-proof as we can probably have other strategy for warming up in the future.

The warmup option should matches the common_params::warmup

@ngxson
Copy link
Collaborator

ngxson commented Dec 1, 2025

Superseded by #17652

It will be more suitable in your use case, as you know the image size in advanced, not the number of tokens

@ngxson ngxson closed this Dec 1, 2025
@SmartestWashingMachine
Copy link
Contributor Author

Oh, that's even better. Thanks for taking the time to look into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants