Skip to content

Conversation

gabe-l-hart
Copy link
Collaborator

NOTES

This PR replaces #16112 since that one was based on a branch in the core repo and I no longer have push access to the branch after #16113.

closes #16110

Description

This PR adds model support for https://huggingface.co/ibm-granite/granite-docling-258M. There are two key parts to the support:

  1. Conversion support: Changes to convert_hf_to_gguf.py and the adjacent scripts/libraries to support the model
  2. Overhaul of preprocessing support for idefics3

Details

The current code simply resizes the input to a square with padding. This is very different than the preprocessing implementation in transformers. From what I can tell, support for the more nuanced tile-based preprocessing was lost during the transition from llava to mtmd.

The logic in transformers is:

  1. Resize the image so that the longest side is no larger than preprocessor_config.size.longest_edge (if do_resize here)
  2. Resize and reshape the image so that both sides are an even multiple of image_size (if do_image_splitting here)
  3. Create a N x M grid of images sized image_size x image_size where N = width / image_size and M = height / image_size
  4. The input image is resized to image_size x image_size and added to the list of patches as the <global-img>
  5. Each image in the grid is concatenated with <fake_token_around_image><row_R_col_C>{image tokens}
    1. If the image is the last in the row, a \n is added to the end
  6. The global image is added with \n<fake_token_around_image><global-img>{image tokens}<fake_token_around_image>

Open Questions

In transformers, there are conditionals that have default values from preprocessor_config for do_resize and do_image_splitting. These can be overwritten with kwargs at invocation time. For both Granite Docling and SmolVLM, the defaults are true for both values, and the models give demonstrably bad results if the above tiling scheme is not followed. As such, I've opted to not make these configurable either in the hparams or as input to mtmd_encode. One alternate approach would be to decouple the preprocessing schemes from the models and give models default preprocessing schemes while allowing alternate schemes to be chosen at runtime. Given the poor results with the wrong preprocessing, though, I don't think this is worth the effort.

Testing

# Convert language model
 python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M/

# Convert MMProj
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M/ --mmproj

# Run a sample
./bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf --image ~/Pictures/sample-doc.png --mmproj ~/models/ibm-granite/mmproj-granite-docling-258M -p "<__media__>Convert this page to markdown." --verbose -ngl 99

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…icing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@github-actions github-actions bot added examples python python script changes labels Sep 23, 2025
LLAMA_VOCAB_PRE_TYPE_KIMI_K2 = 37,
LLAMA_VOCAB_PRE_TYPE_HUNYUAN_DENSE = 38,
LLAMA_VOCAB_PRE_TYPE_GROK_2 = 39,
LLAMA_VOCAB_PRE_TYPE_GRANITE_DOCLING = 40,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Us and our long names 😞. I assume vertical alignment is worth preserving, but happy to not touch the other lines if preferred.


// vision-specific
#define KEY_IMAGE_SIZE "clip.vision.image_size"
#define KEY_PREPROC_IMAGE_SIZE "clip.vision.preproc_image_size"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't totally sure the right name for this one since it comes from preprocessor_config.size.longest_edge. This seemed appropriate since it's the size of the image for the first step of preprocessing, but I'm up for other names too (preproc_longest_edge?)

// multiples of image_size (always rounding up)
//
// CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics3/image_processing_idefics3.py#L737
const float scale = std::min(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is similar to llava_uhd::get_slice_instructions, but it's different enough that I opted not to put this there for simplicity. I could certainly move it if that would be better though.

if (clip_is_llava(ctx_clip)
|| clip_is_minicpmv(ctx_clip)
|| clip_is_glm(ctx_clip)
|| clip_is_idefics3(ctx_clip)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't fully clear whether this was correct to limit to non-batch encoding, but since other tile-based models fall in this category, I figured it was correct. If we don't need this, we could also remove clip_is_idefics3.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing this and things appear to work, so I've removed this part. I haven't tested robustly for concurrent inference, though.

llama_token tok_sli_img_end = LLAMA_TOKEN_NULL; // single slice end
llama_token tok_sli_img_mid = LLAMA_TOKEN_NULL; // between 2 slices
llama_token tok_row_end = LLAMA_TOKEN_NULL; // end of row
std::vector<llama_token> tok_ov_img_start; // overview image
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overview image prefix is multiple tokens long for idefics3. It seemed likely that other models might have this for other delimiter tokens as well, so since they're all getting converted to std::vector<llama_token> to pass to add_text anyway, I figured it was more extensible to make all of these std::vector<llama_token>. It also avoids the need to check the sentinel value LLAMA_TOKEN_NULL.

bool use_mrope = false; // for Qwen2VL, we need to use M-RoPE

// string template for slice image delimiters with row/col (idefics3)
std::string sli_img_start_tmpl;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, there was no way to create the <row_R_col_C> delimiters since the row and col values need to be passed as inputs. I opted to use a basic printf style template here. One alternate approach would be to make this a nullable function handle that could be implemented with a lambda for individual models. That might be slightly more efficient to avoid the double snprintf below, but I'm not sure how that would trade off with the overhead of managing the function pointer and using std::string concatenation methods.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master: (39 commits)
ci : disable AMD workflows + update NVIDIA workflows (ggml-org#16200)
ci : enable Vulkan workflow on Mac (ggml-org#16194)
ggml-cpu: Respect cpumask settings (ggml-org#16164)
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (ggml-org#15928)
zdnn: refactor codebase + add docs (ggml-org#16178)
codeowners : add @danbev to model-conversion example [no ci] (ggml-org#16190)
devops: add s390x containers (ggml-org#15915)
ggml-cpu : fix typo in gemm comments [no ci] (ggml-org#16189)
feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (ggml-org#16177)
clang-tidy : disable warning about performance enum size (ggml-org#16127)
ggml : implement set_rows with i32 index (ggml-org#16159)
codeowners : update + cleanup (ggml-org#16174)
common : enable `--offline` mode without curl support (ggml-org#16137)
webui : fix handling incomplete chunks (ggml-org#16107)
embedding : fix typos in README (ggml-org#16171)
common : remove unused local variables (ggml-org#16140)
ggml : extend ggml_can_fuse to work with non-sequential nodes (ggml-org#16123)
ggml : add ggml_op_is_empty (ggml-org#16122)
codeowners : update ownership for @ngxson and @allozuar (ggml-org#16128)
Vulkan: add conv_transpose_2d operation (ggml-org#16022)
...
@CISC
Copy link
Collaborator

CISC commented Sep 23, 2025

Generally LGTM, but I'll have to defer to @ngxson for mtmd.

@gabe-l-hart
Copy link
Collaborator Author

@ngxson A gentle nudge on reviewing this (without disrupting your vacation, of course!). The model has been sitting in the top-ten trending on HF (hitting #1 a few days ago), so there's a lot of community interest in getting this fully supported everywhere, especially for embedded inference with llama.cpp-based platforms.

* origin/master: (124 commits)
metal : fix loop bound in ggml_mem_ranges (ggml-org#16412)
llama : fix shapes for bert/mpt q/k norm (ggml-org#16409)
ggml : fix graph reallocation with multiple chunks (ggml-org#16396)
Fix missing messages on sibling navigation (ggml-org#16408)
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (ggml-org#16354)
vulkan: Fix FA coopmat1 invalid array indexing (ggml-org#16365)
ci : change macos-13 to macos-15-intel (ggml-org#16401)
Capture model name only after first token (streaming) or completed request (ggml-org#16405)
vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (ggml-org#16316)
webui : Fix messages payload sent to chat completions (ggml-org#16402)
fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (ggml-org#16356)
test-barrier : do not use more threads than physically available (ggml-org#16389)
ggml webgpu: add support for soft_max, optimize rms_norm (ggml-org#16357)
model : Apertus model implementation (ggml-org#15852)
musa: update compile flags (ggml-org#16265)
ci : fix ubuntu-latest-cmake-rpc (disable ccache) (ggml-org#16388)
ci: update vulkan ci (ggml-org#16294)
ci : fix clean-up of old logs (ggml-org#16381)
SYCL: Update to oneAPI 2025.2 (ggml-org#16371)
HIP: add IMbackK to codeowner (ggml-org#16375)
...
Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Collaborator Author

@ngxson Thanks for the review! I've addressed the comments.

Strangely, now I'm consistently seeing the image fail to terminate generation. Applying these same fixes to SmolVLM produces much better results than without and does not fail to terminate, so there seems to be something about granite-docling specifically causing this.

@ngxson
Copy link
Collaborator

ngxson commented Oct 3, 2025

Consider the changes don't touch token placements, it's quite surprising that we have such error. Did you also tried with top_k=1 ?

I haven't examined your code closely but maybe it's worth comparing its results againts calc_size_preserved_ratio, probably I missed something

@gabe-l-hart
Copy link
Collaborator Author

Ah, my comment was confusing. I tried at several points in history (before these review refactors, and before the most recent merge of master) and they all show the failure to stop, so I think it's something else. I was just surprised since I swear I wasn't seeing this when I was working on it a few weeks ago.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After merging with latest master, the test model still runs well, so this should be good to merge

Any other problems can be addressed in follow-up PRs. Thanks @gabe-l-hart and the IBM team for the awesome model!


Full test result:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

@CISC CISC merged commit ca71fb9 into ggml-org:master Oct 5, 2025
72 of 75 checks passed
@gabe-l-hart
Copy link
Collaborator Author

Awesome, thanks for the review! I'll look into the stopping issue soon.

@gabe-l-hart gabe-l-hart deleted the GraniteDocling branch October 6, 2025 05:07
@gabe-l-hart
Copy link
Collaborator Author

Follow up created to address stopping issue: #16438

@CISC CISC added model Model specific hot Something that is hot labels Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples hot Something that is hot model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model: Granite Docling

3 participants