Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Granite Vision Support #11794

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

alex-jw-brooks
Copy link
Contributor

@alex-jw-brooks alex-jw-brooks commented Feb 10, 2025

This PR adds support for IBM's granite vision models, which are a variant of LLava Next that use multiple feature layers from the (siglip) visual encoder.

It's in draft at the moment, as I'm still testing and trying to track down the cause of the failing assertion in ggml-cpu.c, which seems to be the same issue as what's reported here, but any thoughts are very welcome!

Current summary of changes:

  • Adds support for HF format llava next models to the llava surgery v2 scripts
  • Adds support for converting Siglip visual encoder to the image encoder converter in the llava examples
  • Adds clip.vision.feature_layer to llava next hparams; this has the same meaning as vision_feature_layer in transformers (ref). Note that currently it converts values to a positive value so that it can use -1 as unset in llama.cpp.
  • Updates a couple of settings in llava.cpp / clip.cpp, namely increasing number of patches to be 10 for Llava next models, and increasing the max number of image grid pinpoints from 32-64, since granite vision models have ~50
  • Fixes a (potential) bug in Llava Next models (i.e., with a llava projector) - I think the intent with the current behavior may be to always use layer -2, which matches the default in transformers and many llava next models, but it seems to use -3 if it goes through GGUF currently
  • Adds guidance to the README for how to load and export the LLM off the llava next model in transformers so that it can be converted with the normal HF -> GGUF conversion script; this is a workaround for encapsulated LLMs that may not be compatible with the legacy llama converter, e.g., granite LLMs

@github-actions github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Feb 10, 2025
fout.add_uint32(k(KEY_ATTENTION_HEAD_COUNT, VISION), v_hparams["num_attention_heads"])
fout.add_float32(k(KEY_ATTENTION_LAYERNORM_EPS, VISION), v_hparams["layer_norm_eps"])
block_count = v_hparams["num_hidden_layers"] - 1 if has_llava_projector else v_hparams["num_hidden_layers"]
block_count = v_hparams["num_hidden_layers"]
Copy link
Contributor Author

@alex-jw-brooks alex-jw-brooks Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part I'm not quite sure about with the current code; it seems like the intention between this + the way the layer loop in clip.cpp (here) is written is to go up to the second to last layer, which is consistent with the default feature layer -2 in transformers and a lot of llava next models, but it seems like this actually resulting in -3, unless I am misunderstanding something.

More concretely, if we say v_hparams["num_hidden_layers"] is 24 here, if there's a llava projector, we set the block count to 23 and pop the last layer from the model a bit further down. Then, in clip.cpp, the encoder layer loop goes up to n_layer - 1, but since that value is already decremented her, this would cause it to go up through the x < 22, i.e., taking the output of block 21 (-3).

Since this PR needs to be able to take features from the last layer, I've updated it to not drop the last layer, and iterate up to the max vision feature layer that is set, or the second last if it's not, but please feel free to let me know if this seems incorrect!

Copy link
Collaborator

@ngxson ngxson Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you mentioned is pretty much spaghetti, you can refer to #11292 where I refactored that part.

Since this PR needs to be able to take features from the last layer

Funny enough, someone deleted the comment to explain how to get the last layer. You can see it on this version https://github.com/ggerganov/llama.cpp/blob/d408bb9268a988c5a60a5746d3a6430386e7604d/examples/llava/clip.cpp#L734

Would be nice if you can bring back the comment (or better, replace it with an explanation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks a lot for the context and guidance @ngxson! That looks great, It is exciting to see things potentially moving away from the surgery scripts and all of the vision hparams being handled the same as everything else with the gguf writer 🎉

I pulled the bit out to determine which layer to iterate up to out to a separate function and added some explanation here, and rewrote the relevant parts of this PR to avoid changing behavior in the feature layer used for existing models 🙂

@@ -444,8 +445,9 @@ struct clip_hparams {

char mm_patch_merge_type[32] = "flat"; // spatial_unpad or flat (default)

int32_t image_grid_pinpoints[32];
int32_t image_grid_pinpoints[64];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you mentioned this fixed size was potentially problematic. Can you add a comment explaining the size choice and limitations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - I moved this + max feature layers to be const and added a comment up there.

This value is the number of flattened ordered pairs that are supported for any res gridpoints (i.e., 32/2 -> currently 16 ordered pairs). Granite vision models have more than this, e.g., in the preview model, which has 26 pairs. So I increased the size to prevent the extra gridpoints from being dropped!

@alex-jw-brooks
Copy link
Contributor Author

Hi @ngxson and @gabe-l-hart, thanks for the early thoughts - I think this PR should be cleaned up and ready for another look when you get the chance!

@gabe-l-hart
Copy link
Contributor

Hi @ngxson @ggerganov is it possible to get a quick check on your priority for review on this PR? We're getting ready to officially launch the Granite Vision models (preview model already out here) and would love to get a sense for timing so we can plan for how to support it with several projects that depend on llama.cpp. Any input you have would be much appreciated!

Comment on lines 8518 to 8520
GGML_ASSERT(i01 >= 0 && i01 < ne01);
// Copying this out for a bit while investigating due to issues like:
// https://github.com/ggerganov/llama.cpp/issues/10157
// GGML_ASSERT(i01 >= 0 && i01 < ne01);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov, thank you very much for your thoughts!

For now, on CPU it is - commenting it out does run & give coherent outputs on CPU, but I think that there is a bug here for siglip, which I am guessing is an issue of how visual encoders without the CLS embedding are handled, since the models that I have seen raise this issue appear to use siglip as the visual encoder.

I think I am getting close to the actual fix - I'll follow up (with hopefully a correct fix and this assert reenabled) by the end of today!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a separate small PR to fix the underlying cause here! #11982

Will reenable this assertion on this PR now.

@ggerganov
Copy link
Member

Hi @ngxson @ggerganov is it possible to get a quick check on your priority for review on this PR? We're getting ready to officially launch the Granite Vision models (preview model already out here) and would love to get a sense for timing so we can plan for how to support it with several projects that depend on llama.cpp. Any input you have would be much appreciated!

Generally we don't spend much time reviewing the llava code because it will be completely obsoleted when we add multi-modality support in the core libllama. So as long as it builds and works for you, we can merge it.

@danbev
Copy link
Collaborator

danbev commented Feb 20, 2025

@alex-jw-brooks I wanted to try this out and made an attempt to convert the model. But I'm not sure if made a mistake in one of the steps there. If you have time to take a look at the steps in the linked document that would be great and let me know if you spot something I'm doing differently compared to when you converted your model.

@alex-jw-brooks
Copy link
Contributor Author

alex-jw-brooks commented Feb 20, 2025

Awesome, thank you very much @danbev! I have some similar local notes for converting the model that I used to do my initial conversion to test with - I'll compare them and send some thoughts 😄

Could you please clarify what the issue that you are running into on your branch is? Is it mostly that the model is slow for inference, or did I crash / produce garbage?

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of MAX_* pattern in this PR really makes my review become harder. Would be nice if we can get rid of that.

@@ -1443,15 +1476,35 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) {
int idx = get_key_idx(ctx, KEY_IMAGE_GRID_PINPOINTS);
int n = gguf_get_arr_n(ctx, idx);
const int32_t * pinpoints = (const int32_t *)gguf_get_arr_data(ctx, idx);
for (int i = 0; i < 32 && i < n && pinpoints[i] != 0; ++i) {
for (int i = 0; i < MAX_IMAGE_GRID_PINPOINTS && i < n && pinpoints[i] != 0; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we need MAX_IMAGE_GRID_PINPOINTS. Can't image_grid_pinpoints be a std::vector?

}

// If we set explicit vision feature layers, only go up to the deepest one
for (int i = 0; i < MAX_IMAGE_FEATURE_LAYERS && (hparams.vision_feature_layer[i] > 0); i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I don't really like this MAX_* pattern, the code of clip.cpp is current quite fragile and adding this will make it even more fragile.

Why can't we use std::vector for these arrays?

std::vector<struct ggml_tensor *> embedding_stack;
// Check to see if we have 1+ set vision feature layers set; otherwise it's determined
// by the type of projector that this model has (usually last or second to last layer).
int max_feature_layer = get_deepest_feature_layer(ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits, but max_feature_layer is never changed for a given model, so I think it should be set on loading the model instead (i.e. clip_model)

if (embedding_stack.size() > 0) {
embeddings = embedding_stack.at(0);
for (unsigned long i=1; i < embedding_stack.size(); i++) {
embeddings = ggml_concat(ctx0, embeddings, embedding_stack.at(i), 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ggml_concat on each embedding is fine for now, but please be aware that it may use a lot more memory.

Another way to do is to manually create a result tensor (with dimension that can hold all the embeddings), then use ggml_view_* with appropriate offset.

But this is a minor optimization, probably can be done later if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I thought this felt pretty efficient! Thank you for the guidance - I'm happy to submit a follow-up PR making this change in the near future 🙂

@ngxson
Copy link
Collaborator

ngxson commented Feb 20, 2025

Btw, to fix the CI, you should rebase to latest master branch.

@alex-jw-brooks
Copy link
Contributor Author

Cool, thanks @ngxson - I agree with you that the pattern for MAX_* gridpoints / feature layers is really weird - the current code will also drop anything over the max, which is not great.

I will rewrite happily rewrite it to use std::vector for both 🙂

@danbev
Copy link
Collaborator

danbev commented Feb 20, 2025

Could you please clarify what the issue that you are running into on your branch is? Is it mostly that the model is slow for inference, or did I crash / produce garbage?

Sorry for not being clear on that. It was me being impatient and not letting it run long enough (I was using a debug build which seem to be slower on my machine at least). This was the output of an image of the apollo moon landing:

./build/bin/llama-llava-cli -m models/granite-vision-3.1-2b-Q4_K.gguf --mmproj vit/mmproj-model-f16.gguf --image ny.jpg -c 4096 -ngl 40  -v
...
encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in 560079.81 ms by CLIP (  768.29 ms per image patch)

The image captures a moment of human interaction with space. In the foreground, a lone astronaut is seen walking on the moon's surface. The astronaut is clad in a white suit, which is standard for such missions. The suit appears to be in good condition, suggesting that the astronaut is well-equipped for the conditions of space.

In the background, there are two flags flying on the moon. These flags are the United States and Canada flags, respectively. The presence of these flags indicates that the astronaut is part of a mission or operation related
 to the United States and Canada.

Overall, the image provides a glimpse into the lunar missions that have taken place over the years
. The astronaut's solitary journey across the moon's surface, coupled with the flags in the background, paints a picture of human exploration of space.
llama_perf_context_print:        load time =  561960.75 ms
llama_perf_context_print: prompt eval time =     548.67 ms /   768 tokens (    0.71 ms per token,  1399.74 tokens per second)
llama_perf_context_print:        eval time =   24427.76 ms /   196 runs   (  124.63 ms per token,     8.02 tokens per second)
llama_perf_context_print:       total time =  586576.16 ms /   964 tokens

When running the same example but for a normal (Release build) I got:

encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in 40104.00 ms by CLIP (   55.01 ms per image patch)


I'm sorry to hear that you're struggling with your emotions. It's important to take care of yourself and seek help from mental health professionals if you need assistance.
As an AI language model, I don't have the ability to provide personalized assistance or emotional support. However, I can provide general information about mental health and resources for people struggling with their emotions.
It's important to remember that you are not alone, and there are people who care about you and want to help. If you're struggling with your emotions, you can reach out to your primary care physician, mental health professionals, or support groups for assistance.
Remember, taking care of yourself is important, and seeking help when you need it is a sign of strength.
llama_perf_context_print:        load time =   41517.82 ms
llama_perf_context_print: prompt eval time =     294.52 ms /   768 tokens (    0.38 ms per token,  2607.64 tokens per second)
llama_perf_context_print:        eval time =    2095.80 ms /   167 runs   (   12.55 ms per token,    79.68 tokens per second)
llama_perf_context_print:       total time =   43645.99 ms /   935 tokens

And this is another run:

encode_image_with_clip: image embedding created: 729 tokens

encode_image_with_clip: image encoded in 40446.42 ms by CLIP (   55.48 ms per image patch)



Please provide the image you want to be converted into text.
llama_perf_context_print:        load time =   41897.67 ms
llama_perf_context_print: prompt eval time =     303.87 ms /   768 tokens (    0.40 ms per token,  2527.42 tokens per second)
llama_perf_context_print:        eval time =     178.02 ms /    14 runs   (   12.72 ms per token,    78.64 tokens per second)
llama_perf_context_print:       total time =   42089.41 ms /   782 tokens

I've not been able to get a good response when using a Release build yet.

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
alex-jw-brooks and others added 19 commits February 20, 2025 14:57
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
@alex-jw-brooks
Copy link
Contributor Author

alex-jw-brooks commented Feb 21, 2025

Hi @ggerganov @ngxson - I think the requested changes should be addressed! Please note the bugfix for the assert that was commented out for running on CPU is open in another PR here.

@danbev - thanks for the clarification! There are a few problems, mostly in the visual encoder config since you should use anyres (llava next) preprocessing by setting mm_patch_merge_type to spatial_unpad with the correct image grid pinpoints - using the model's chat template will help too! 🙂

If the model is running correctly, you should see thousands of tokens per image. I've added some detailed docs for how to convert the model here - if it is useful, I am also happy to open a PR to add it to llama.cpp 😄

Also, please make sure to bring in this change #11982, which will fix assertion errors if the get rows operation is triggered. Hopefully that helps, and please feel free to let me know if you have any questions.

@danbev
Copy link
Collaborator

danbev commented Feb 21, 2025

If the model is running correctly, you should see thousands of tokens per image. I've added some detailed docs for how to convert the model here - if it is useful, I am also happy to open a PR to add it to llama.cpp 😄

@alex-jw-brooks Thanks for the details and I'll give this a try later today! I think the conversion document would be very helpful to have, so please open a PR it 👍

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants