Add Granite Vision Support #11794

alex-jw-brooks · 2025-02-10T15:49:41Z

This PR adds support for IBM's granite vision models, which are a variant of LLava Next that use multiple feature layers from the (siglip) visual encoder.

It's in draft at the moment, as I'm still testing and trying to track down the cause of the failing assertion in ggml-cpu.c, which seems to be the same issue as what's reported here, but any thoughts are very welcome!

Current summary of changes:

Adds support for HF format llava next models to the llava surgery v2 scripts
Adds support for converting Siglip visual encoder to the image encoder converter in the llava examples
Adds clip.vision.feature_layer to llava next hparams; this has the same meaning as vision_feature_layer in transformers (ref). Note that currently it converts values to a positive value so that it can use -1 as unset in llama.cpp.
Updates a couple of settings in llava.cpp / clip.cpp, namely increasing number of patches to be 10 for Llava next models, and increasing the max number of image grid pinpoints from 32-64, since granite vision models have ~50
Fixes a (potential) bug in Llava Next models (i.e., with a llava projector) - I think the intent with the current behavior may be to always use layer -2, which matches the default in transformers and many llava next models, but it seems to use -3 if it goes through GGUF currently
Adds guidance to the README for how to load and export the LLM off the llava next model in transformers so that it can be converted with the normal HF -> GGUF conversion script; this is a workaround for encapsulated LLMs that may not be compatible with the legacy llama converter, e.g., granite LLMs

alex-jw-brooks · 2025-02-10T16:04:00Z

examples/llava/convert_image_encoder_to_gguf.py

    fout.add_uint32(k(KEY_ATTENTION_HEAD_COUNT, VISION), v_hparams["num_attention_heads"])
    fout.add_float32(k(KEY_ATTENTION_LAYERNORM_EPS, VISION), v_hparams["layer_norm_eps"])
-    block_count = v_hparams["num_hidden_layers"] - 1 if has_llava_projector else v_hparams["num_hidden_layers"]
+    block_count = v_hparams["num_hidden_layers"]


This is the part I'm not quite sure about with the current code; it seems like the intention between this + the way the layer loop in clip.cpp (here) is written is to go up to the second to last layer, which is consistent with the default feature layer -2 in transformers and a lot of llava next models, but it seems like this actually resulting in -3, unless I am misunderstanding something.

More concretely, if we say v_hparams["num_hidden_layers"] is 24 here, if there's a llava projector, we set the block count to 23 and pop the last layer from the model a bit further down. Then, in clip.cpp, the encoder layer loop goes up to n_layer - 1, but since that value is already decremented her, this would cause it to go up through the x < 22, i.e., taking the output of block 21 (-3).

Since this PR needs to be able to take features from the last layer, I've updated it to not drop the last layer, and iterate up to the max vision feature layer that is set, or the second last if it's not, but please feel free to let me know if this seems incorrect!

The code you mentioned is pretty much spaghetti, you can refer to #11292 where I refactored that part.

Since this PR needs to be able to take features from the last layer

Funny enough, someone deleted the comment to explain how to get the last layer. You can see it on this version https://github.com/ggerganov/llama.cpp/blob/d408bb9268a988c5a60a5746d3a6430386e7604d/examples/llava/clip.cpp#L734

Would be nice if you can bring back the comment (or better, replace it with an explanation)

Awesome, thanks a lot for the context and guidance @ngxson! That looks great, It is exciting to see things potentially moving away from the surgery scripts and all of the vision hparams being handled the same as everything else with the gguf writer 🎉

I pulled the bit out to determine which layer to iterate up to out to a separate function and added some explanation here, and rewrote the relevant parts of this PR to avoid changing behavior in the feature layer used for existing models 🙂

examples/llava/clip.cpp

gabe-l-hart · 2025-02-10T16:55:32Z

examples/llava/clip.cpp

@@ -444,8 +445,9 @@ struct clip_hparams {

    char mm_patch_merge_type[32] = "flat"; // spatial_unpad or flat (default)

-    int32_t image_grid_pinpoints[32];
+    int32_t image_grid_pinpoints[64];


I know you mentioned this fixed size was potentially problematic. Can you add a comment explaining the size choice and limitations?

Yup - I moved this + max feature layers to be const and added a comment up there.

This value is the number of flattened ordered pairs that are supported for any res gridpoints (i.e., 32/2 -> currently 16 ordered pairs). Granite vision models have more than this, e.g., in the preview model, which has 26 pairs. So I increased the size to prevent the extra gridpoints from being dropped!

examples/llava/clip.cpp

examples/llava/convert_image_encoder_to_gguf.py

examples/llava/clip.cpp

alex-jw-brooks · 2025-02-13T19:18:47Z

Hi @ngxson and @gabe-l-hart, thanks for the early thoughts - I think this PR should be cleaned up and ready for another look when you get the chance!

gabe-l-hart · 2025-02-19T15:04:55Z

Hi @ngxson @ggerganov is it possible to get a quick check on your priority for review on this PR? We're getting ready to officially launch the Granite Vision models (preview model already out here) and would love to get a sense for timing so we can plan for how to support it with several projects that depend on llama.cpp. Any input you have would be much appreciated!

ggerganov · 2025-02-20T08:24:30Z

ggml/src/ggml-cpu/ggml-cpu.c

-        GGML_ASSERT(i01 >= 0 && i01 < ne01);
+        // Copying this out for a bit while investigating due to issues like:
+        // https://github.com/ggerganov/llama.cpp/issues/10157
+        // GGML_ASSERT(i01 >= 0 && i01 < ne01);


Is this still needed?

Hi @ggerganov, thank you very much for your thoughts!

For now, on CPU it is - commenting it out does run & give coherent outputs on CPU, but I think that there is a bug here for siglip, which I am guessing is an issue of how visual encoders without the CLS embedding are handled, since the models that I have seen raise this issue appear to use siglip as the visual encoder.

I think I am getting close to the actual fix - I'll follow up (with hopefully a correct fix and this assert reenabled) by the end of today!

Opened a separate small PR to fix the underlying cause here! #11982

Will reenable this assertion on this PR now.

ggerganov · 2025-02-20T08:27:59Z

Hi @ngxson @ggerganov is it possible to get a quick check on your priority for review on this PR? We're getting ready to officially launch the Granite Vision models (preview model already out here) and would love to get a sense for timing so we can plan for how to support it with several projects that depend on llama.cpp. Any input you have would be much appreciated!

Generally we don't spend much time reviewing the llava code because it will be completely obsoleted when we add multi-modality support in the core libllama. So as long as it builds and works for you, we can merge it.

danbev · 2025-02-20T14:05:23Z

@alex-jw-brooks I wanted to try this out and made an attempt to convert the model. But I'm not sure if made a mistake in one of the steps there. If you have time to take a look at the steps in the linked document that would be great and let me know if you spot something I'm doing differently compared to when you converted your model.

alex-jw-brooks · 2025-02-20T16:55:45Z

Awesome, thank you very much @danbev! I have some similar local notes for converting the model that I used to do my initial conversion to test with - I'll compare them and send some thoughts 😄

Could you please clarify what the issue that you are running into on your branch is? Is it mostly that the model is slow for inference, or did I crash / produce garbage?

ngxson

The introduction of MAX_* pattern in this PR really makes my review become harder. Would be nice if we can get rid of that.

ngxson · 2025-02-20T18:13:13Z

examples/llava/clip.cpp

@@ -1443,15 +1476,35 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
            int idx = get_key_idx(ctx, KEY_IMAGE_GRID_PINPOINTS);
            int n = gguf_get_arr_n(ctx, idx);
            const int32_t * pinpoints = (const int32_t *)gguf_get_arr_data(ctx, idx);
-            for (int i = 0; i < 32 && i < n && pinpoints[i] != 0; ++i) {
+            for (int i = 0; i < MAX_IMAGE_GRID_PINPOINTS && i < n && pinpoints[i] != 0; ++i) {


I don't get why we need MAX_IMAGE_GRID_PINPOINTS. Can't image_grid_pinpoints be a std::vector?

ngxson · 2025-02-20T18:16:57Z

examples/llava/clip.cpp

+    }
+
+    // If we set explicit vision feature layers, only go up to the deepest one
+    for (int i = 0; i < MAX_IMAGE_FEATURE_LAYERS && (hparams.vision_feature_layer[i] > 0); i++) {


Tbh I don't really like this MAX_* pattern, the code of clip.cpp is current quite fragile and adding this will make it even more fragile.

Why can't we use std::vector for these arrays?

ngxson · 2025-02-20T18:18:14Z

examples/llava/clip.cpp

+    std::vector<struct ggml_tensor *> embedding_stack;
+    // Check to see if we have 1+ set vision feature layers set; otherwise it's determined
+    // by the type of projector that this model has (usually last or second to last layer).
+    int max_feature_layer = get_deepest_feature_layer(ctx);


small nits, but max_feature_layer is never changed for a given model, so I think it should be set on loading the model instead (i.e. clip_model)

examples/llava/clip.cpp

ngxson · 2025-02-20T18:26:27Z

examples/llava/clip.cpp

+    if (embedding_stack.size() > 0) {
+        embeddings = embedding_stack.at(0);
+        for (unsigned long i=1; i < embedding_stack.size(); i++) {
+            embeddings = ggml_concat(ctx0, embeddings, embedding_stack.at(i), 0);


Using ggml_concat on each embedding is fine for now, but please be aware that it may use a lot more memory.

Another way to do is to manually create a result tensor (with dimension that can hold all the embeddings), then use ggml_view_* with appropriate offset.

But this is a minor optimization, probably can be done later if needed.

Makes sense, I thought this felt pretty efficient! Thank you for the guidance - I'm happy to submit a follow-up PR making this change in the near future 🙂

ngxson · 2025-02-20T18:28:07Z

Btw, to fix the CI, you should rebase to latest master branch.

alex-jw-brooks · 2025-02-20T18:41:05Z

Cool, thanks @ngxson - I agree with you that the pattern for MAX_* gridpoints / feature layers is really weird - the current code will also drop anything over the max, which is not great.

I will rewrite happily rewrite it to use std::vector for both 🙂

danbev · 2025-02-20T20:11:47Z

Could you please clarify what the issue that you are running into on your branch is? Is it mostly that the model is slow for inference, or did I crash / produce garbage?

Sorry for not being clear on that. It was me being impatient and not letting it run long enough (I was using a debug build which seem to be slower on my machine at least). This was the output of an image of the apollo moon landing:

./build/bin/llama-llava-cli -m models/granite-vision-3.1-2b-Q4_K.gguf --mmproj vit/mmproj-model-f16.gguf --image ny.jpg -c 4096 -ngl 40  -v
...
encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in 560079.81 ms by CLIP (  768.29 ms per image patch)

The image captures a moment of human interaction with space. In the foreground, a lone astronaut is seen walking on the moon's surface. The astronaut is clad in a white suit, which is standard for such missions. The suit appears to be in good condition, suggesting that the astronaut is well-equipped for the conditions of space.

In the background, there are two flags flying on the moon. These flags are the United States and Canada flags, respectively. The presence of these flags indicates that the astronaut is part of a mission or operation related
 to the United States and Canada.

Overall, the image provides a glimpse into the lunar missions that have taken place over the years
. The astronaut's solitary journey across the moon's surface, coupled with the flags in the background, paints a picture of human exploration of space.
llama_perf_context_print:        load time =  561960.75 ms
llama_perf_context_print: prompt eval time =     548.67 ms /   768 tokens (    0.71 ms per token,  1399.74 tokens per second)
llama_perf_context_print:        eval time =   24427.76 ms /   196 runs   (  124.63 ms per token,     8.02 tokens per second)
llama_perf_context_print:       total time =  586576.16 ms /   964 tokens

When running the same example but for a normal (Release build) I got:

encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in 40104.00 ms by CLIP (   55.01 ms per image patch)


I'm sorry to hear that you're struggling with your emotions. It's important to take care of yourself and seek help from mental health professionals if you need assistance.
As an AI language model, I don't have the ability to provide personalized assistance or emotional support. However, I can provide general information about mental health and resources for people struggling with their emotions.
It's important to remember that you are not alone, and there are people who care about you and want to help. If you're struggling with your emotions, you can reach out to your primary care physician, mental health professionals, or support groups for assistance.
Remember, taking care of yourself is important, and seeking help when you need it is a sign of strength.
llama_perf_context_print:        load time =   41517.82 ms
llama_perf_context_print: prompt eval time =     294.52 ms /   768 tokens (    0.38 ms per token,  2607.64 tokens per second)
llama_perf_context_print:        eval time =    2095.80 ms /   167 runs   (   12.55 ms per token,    79.68 tokens per second)
llama_perf_context_print:       total time =   43645.99 ms /   935 tokens

And this is another run:

encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in 40446.42 ms by CLIP (   55.48 ms per image patch)



Please provide the image you want to be converted into text.
llama_perf_context_print:        load time =   41897.67 ms
llama_perf_context_print: prompt eval time =     303.87 ms /   768 tokens (    0.40 ms per token,  2527.42 tokens per second)
llama_perf_context_print:        eval time =     178.02 ms /    14 runs   (   12.72 ms per token,    78.64 tokens per second)
llama_perf_context_print:       total time =   42089.41 ms /   782 tokens

I've not been able to get a good response when using a Release build yet.

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks · 2025-02-21T05:40:03Z

Hi @ggerganov @ngxson - I think the requested changes should be addressed! Please note the bugfix for the assert that was commented out for running on CPU is open in another PR here.

@danbev - thanks for the clarification! There are a few problems, mostly in the visual encoder config since you should use anyres (llava next) preprocessing by setting mm_patch_merge_type to spatial_unpad with the correct image grid pinpoints - using the model's chat template will help too! 🙂

If the model is running correctly, you should see thousands of tokens per image. I've added some detailed docs for how to convert the model here - if it is useful, I am also happy to open a PR to add it to llama.cpp 😄

Also, please make sure to bring in this change #11982, which will fix assertion errors if the get rows operation is triggered. Hopefully that helps, and please feel free to let me know if you have any questions.

danbev · 2025-02-21T06:37:13Z

If the model is running correctly, you should see thousands of tokens per image. I've added some detailed docs for how to convert the model here - if it is useful, I am also happy to open a PR to add it to llama.cpp 😄

@alex-jw-brooks Thanks for the details and I'll give this a try later today! I think the conversion document would be very helpful to have, so please open a PR it 👍

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Feb 10, 2025

alex-jw-brooks commented Feb 10, 2025

View reviewed changes

alex-jw-brooks mentioned this pull request Feb 10, 2025

Support IBM granite-vision-3.1-2b ollama/ollama#8901

Open

gabe-l-hart reviewed Feb 10, 2025

View reviewed changes

gabe-l-hart reviewed Feb 12, 2025

View reviewed changes

examples/llava/clip.cpp Outdated Show resolved Hide resolved

alex-jw-brooks mentioned this pull request Feb 13, 2025

Add patch for granite vision support ollama/ollama#9071

Open

alex-jw-brooks marked this pull request as ready for review February 13, 2025 19:15

alex-jw-brooks requested review from ngxson and gabe-l-hart February 13, 2025 19:15

ggerganov reviewed Feb 20, 2025

View reviewed changes

ngxson reviewed Feb 20, 2025

View reviewed changes

alex-jw-brooks mentioned this pull request Feb 20, 2025

Fix visual encoders with no CLS #11982

Merged

alex-jw-brooks added 8 commits February 20, 2025 14:57

Add super wip scripts for multimodal granite gguf

88e3d9d

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add example for converting mmgranite to gguf

d5da8c9

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

remove hardcoded path

af6585d

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add vision feature layer to gguf params

ab815a5

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Clean up llava surgery and remove name substitution hacks

26ce7f5

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add transformers llava next tensor name mapping

98bb2ff

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Make siglip / openclip mutuall exclusive

0fc506c

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Fix projector linear substitution

6508699

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks and others added 19 commits February 20, 2025 14:57

Avoid dropping last image encoder layer in llava models

44645ca

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Use 10 for max number of patches

a37b176

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Standardize vision feature layers

56e6364

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Cleanup logs

45ab64b

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Update comment for vision feature layer init

4990cfa

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Update notes for alternative to legacy llm conversion script

473606c

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Fix notes rendering

600ab1f

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Add v prefix to vision feature layer log

8b6db6c

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Use current defaults for feature layer

48a941d

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Use constant for max gridpoints / feat layers, style fixes

b1be52a

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

clarify non-negative feature layers

c35956d

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Remove CLIP_API from func signature

71a67b0

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

USE MAX_IMAGE_FEATURE_LAYERS const in layer calc

ba0ef5d

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Clarify feature layers are non negative ints and not uint

b88fe16

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Fix condition for reading feature layers

f825e6b

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

pop last llava layer when feature layers are unset

6ee4be1

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Fix unset vision layer 0

7566532

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Update examples/llava/clip.cpp

23ece06

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Reenable assertion for out of bounds get_rows

16a95d6

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks force-pushed the granite_vision branch from 794dbc2 to 16a95d6 Compare February 20, 2025 21:57

alex-jw-brooks added 3 commits February 20, 2025 21:35

Use std vector for gridpoints and feature layers

11149b6

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Caculate max feature layer at load time

92952d6

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Include base patch for granite vision allocation

49ef7c1

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks requested review from ngxson and ggerganov February 21, 2025 04:45

Fix trailing whitespace

a36e120

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

alex-jw-brooks mentioned this pull request Feb 21, 2025

Add Doc for Converting Granite Vision -> GGUF #12006

Open

Add max num patches = 10 back for minicpmv

ed4d55d

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Granite Vision Support #11794

Add Granite Vision Support #11794

alex-jw-brooks commented Feb 10, 2025 •

edited

Loading

alex-jw-brooks Feb 10, 2025 •

edited

Loading

ngxson Feb 10, 2025 •

edited

Loading

alex-jw-brooks Feb 13, 2025

gabe-l-hart Feb 10, 2025

alex-jw-brooks Feb 12, 2025

alex-jw-brooks commented Feb 13, 2025

gabe-l-hart commented Feb 19, 2025

ggerganov Feb 20, 2025

alex-jw-brooks Feb 20, 2025

alex-jw-brooks Feb 20, 2025

ggerganov commented Feb 20, 2025

danbev commented Feb 20, 2025

alex-jw-brooks commented Feb 20, 2025 •

edited

Loading

ngxson left a comment

ngxson Feb 20, 2025

ngxson Feb 20, 2025

ngxson Feb 20, 2025

ngxson Feb 20, 2025

alex-jw-brooks Feb 21, 2025

ngxson commented Feb 20, 2025

alex-jw-brooks commented Feb 20, 2025

danbev commented Feb 20, 2025

alex-jw-brooks commented Feb 21, 2025 •

edited

Loading

danbev commented Feb 21, 2025

Add Granite Vision Support #11794

Are you sure you want to change the base?

Add Granite Vision Support #11794

Conversation

alex-jw-brooks commented Feb 10, 2025 • edited Loading

alex-jw-brooks Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

ngxson Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-jw-brooks commented Feb 13, 2025

gabe-l-hart commented Feb 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov commented Feb 20, 2025

danbev commented Feb 20, 2025

alex-jw-brooks commented Feb 20, 2025 • edited Loading

ngxson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson commented Feb 20, 2025

alex-jw-brooks commented Feb 20, 2025

danbev commented Feb 20, 2025

alex-jw-brooks commented Feb 21, 2025 • edited Loading

danbev commented Feb 21, 2025

alex-jw-brooks commented Feb 10, 2025 •

edited

Loading

alex-jw-brooks Feb 10, 2025 •

edited

Loading

ngxson Feb 10, 2025 •

edited

Loading

alex-jw-brooks commented Feb 20, 2025 •

edited

Loading

alex-jw-brooks commented Feb 21, 2025 •

edited

Loading