Model: Granite docling + Idefics3 preprocessing (SmolVLM) #16206

gabe-l-hart · 2025-09-23T18:18:31Z

NOTES

This PR replaces #16112 since that one was based on a branch in the core repo and I no longer have push access to the branch after #16113.

closes #16110

Description

This PR adds model support for https://huggingface.co/ibm-granite/granite-docling-258M. There are two key parts to the support:

Conversion support: Changes to convert_hf_to_gguf.py and the adjacent scripts/libraries to support the model
Overhaul of preprocessing support for idefics3

Details

The current code simply resizes the input to a square with padding. This is very different than the preprocessing implementation in transformers. From what I can tell, support for the more nuanced tile-based preprocessing was lost during the transition from llava to mtmd.

The logic in transformers is:

Resize the image so that the longest side is no larger than preprocessor_config.size.longest_edge (if do_resize here)
Resize and reshape the image so that both sides are an even multiple of image_size (if do_image_splitting here)
Create a N x M grid of images sized image_size x image_size where N = width / image_size and M = height / image_size
The input image is resized to image_size x image_size and added to the list of patches as the <global-img>
Each image in the grid is concatenated with <fake_token_around_image><row_R_col_C>{image tokens}
1. If the image is the last in the row, a \n is added to the end
The global image is added with \n<fake_token_around_image><global-img>{image tokens}<fake_token_around_image>

Open Questions

In transformers, there are conditionals that have default values from preprocessor_config for do_resize and do_image_splitting. These can be overwritten with kwargs at invocation time. For both Granite Docling and SmolVLM, the defaults are true for both values, and the models give demonstrably bad results if the above tiling scheme is not followed. As such, I've opted to not make these configurable either in the hparams or as input to mtmd_encode. One alternate approach would be to decouple the preprocessing schemes from the models and give models default preprocessing schemes while allowing alternate schemes to be chosen at runtime. Given the poor results with the wrong preprocessing, though, I don't think this is worth the effort.

Testing

# Convert language model
 python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M/

# Convert MMProj
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M/ --mmproj

# Run a sample
./bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf --image ~/Pictures/sample-doc.png --mmproj ~/models/ibm-granite/mmproj-granite-docling-258M -p "<__media__>Convert this page to markdown." --verbose -ngl 99

Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…icing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-09-23T18:25:58Z

src/llama-vocab.h

+    LLAMA_VOCAB_PRE_TYPE_KIMI_K2         = 37,
+    LLAMA_VOCAB_PRE_TYPE_HUNYUAN_DENSE   = 38,
+    LLAMA_VOCAB_PRE_TYPE_GROK_2          = 39,
+    LLAMA_VOCAB_PRE_TYPE_GRANITE_DOCLING = 40,


Us and our long names 😞. I assume vertical alignment is worth preserving, but happy to not touch the other lines if preferred.

gabe-l-hart · 2025-09-23T18:27:08Z

tools/mtmd/clip-impl.h


 // vision-specific
 #define KEY_IMAGE_SIZE          "clip.vision.image_size"
+#define KEY_PREPROC_IMAGE_SIZE  "clip.vision.preproc_image_size"


I wasn't totally sure the right name for this one since it comes from preprocessor_config.size.longest_edge. This seemed appropriate since it's the size of the image for the first step of preprocessing, but I'm up for other names too (preproc_longest_edge?)

gabe-l-hart · 2025-09-23T18:29:10Z

tools/mtmd/clip.cpp

+        //      multiples of image_size (always rounding up)
+        //
+        // CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics3/image_processing_idefics3.py#L737
+        const float scale = std::min(


This logic is similar to llava_uhd::get_slice_instructions, but it's different enough that I opted not to put this there for simplicity. I could certainly move it if that would be better though.

gabe-l-hart · 2025-09-23T18:30:43Z

tools/mtmd/mtmd.cpp

+    if (clip_is_llava(ctx_clip)
+        || clip_is_minicpmv(ctx_clip)
+        || clip_is_glm(ctx_clip)
+        || clip_is_idefics3(ctx_clip)) {


I wasn't fully clear whether this was correct to limit to non-batch encoding, but since other tile-based models fall in this category, I figured it was correct. If we don't need this, we could also remove clip_is_idefics3.

I tried removing this and things appear to work, so I've removed this part. I haven't tested robustly for concurrent inference, though.

gabe-l-hart · 2025-09-23T18:32:54Z

tools/mtmd/mtmd.cpp

-    llama_token tok_sli_img_end   = LLAMA_TOKEN_NULL; // single slice end
-    llama_token tok_sli_img_mid   = LLAMA_TOKEN_NULL; // between 2 slices
-    llama_token tok_row_end       = LLAMA_TOKEN_NULL; // end of row
+    std::vector<llama_token> tok_ov_img_start;  // overview image


The overview image prefix is multiple tokens long for idefics3. It seemed likely that other models might have this for other delimiter tokens as well, so since they're all getting converted to std::vector<llama_token> to pass to add_text anyway, I figured it was more extensible to make all of these std::vector<llama_token>. It also avoids the need to check the sentinel value LLAMA_TOKEN_NULL.

gabe-l-hart · 2025-09-23T18:35:19Z

tools/mtmd/mtmd.cpp

    bool use_mrope = false; // for Qwen2VL, we need to use M-RoPE

+    // string template for slice image delimiters with row/col (idefics3)
+    std::string sli_img_start_tmpl;


As written, there was no way to create the <row_R_col_C> delimiters since the row and col values need to be passed as inputs. I opted to use a basic printf style template here. One alternate approach would be to make this a nullable function handle that could be implemented with a lambda for individual models. That might be slightly more efficient to avoid the double snprintf below, but I'm not sure how that would trade off with the overhead of managing the function pointer and using std::string concatenation methods.

Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

@danbev

* origin/master: (39 commits) ci : disable AMD workflows + update NVIDIA workflows (ggml-org#16200) ci : enable Vulkan workflow on Mac (ggml-org#16194) ggml-cpu: Respect cpumask settings (ggml-org#16164) ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (ggml-org#15928) zdnn: refactor codebase + add docs (ggml-org#16178) codeowners : add @danbev to model-conversion example [no ci] (ggml-org#16190) devops: add s390x containers (ggml-org#15915) ggml-cpu : fix typo in gemm comments [no ci] (ggml-org#16189) feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (ggml-org#16177) clang-tidy : disable warning about performance enum size (ggml-org#16127) ggml : implement set_rows with i32 index (ggml-org#16159) codeowners : update + cleanup (ggml-org#16174) common : enable `--offline` mode without curl support (ggml-org#16137) webui : fix handling incomplete chunks (ggml-org#16107) embedding : fix typos in README (ggml-org#16171) common : remove unused local variables (ggml-org#16140) ggml : extend ggml_can_fuse to work with non-sequential nodes (ggml-org#16123) ggml : add ggml_op_is_empty (ggml-org#16122) codeowners : update ownership for @ngxson and @allozuar (ggml-org#16128) Vulkan: add conv_transpose_2d operation (ggml-org#16022) ...

CISC · 2025-09-23T21:28:54Z

Generally LGTM, but I'll have to defer to @ngxson for mtmd.

gabe-l-hart · 2025-09-29T14:13:07Z

@ngxson A gentle nudge on reviewing this (without disrupting your vacation, of course!). The model has been sitting in the top-ten trending on HF (hitting #1 a few days ago), so there's a lot of community interest in getting this fully supported everywhere, especially for embedded inference with llama.cpp-based platforms.

tools/mtmd/mtmd.cpp

tools/mtmd/clip.cpp

* origin/master: (124 commits) metal : fix loop bound in ggml_mem_ranges (ggml-org#16412) llama : fix shapes for bert/mpt q/k norm (ggml-org#16409) ggml : fix graph reallocation with multiple chunks (ggml-org#16396) Fix missing messages on sibling navigation (ggml-org#16408) vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (ggml-org#16354) vulkan: Fix FA coopmat1 invalid array indexing (ggml-org#16365) ci : change macos-13 to macos-15-intel (ggml-org#16401) Capture model name only after first token (streaming) or completed request (ggml-org#16405) vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (ggml-org#16316) webui : Fix messages payload sent to chat completions (ggml-org#16402) fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (ggml-org#16356) test-barrier : do not use more threads than physically available (ggml-org#16389) ggml webgpu: add support for soft_max, optimize rms_norm (ggml-org#16357) model : Apertus model implementation (ggml-org#15852) musa: update compile flags (ggml-org#16265) ci : fix ubuntu-latest-cmake-rpc (disable ccache) (ggml-org#16388) ci: update vulkan ci (ggml-org#16294) ci : fix clean-up of old logs (ggml-org#16381) SYCL: Update to oneAPI 2025.2 (ggml-org#16371) HIP: add IMbackK to codeowner (ggml-org#16375) ...

Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-10-03T18:06:52Z

@ngxson Thanks for the review! I've addressed the comments.

Strangely, now I'm consistently seeing the image fail to terminate generation. Applying these same fixes to SmolVLM produces much better results than without and does not fail to terminate, so there seems to be something about granite-docling specifically causing this.

ngxson · 2025-10-03T18:33:36Z

Consider the changes don't touch token placements, it's quite surprising that we have such error. Did you also tried with top_k=1 ?

I haven't examined your code closely but maybe it's worth comparing its results againts calc_size_preserved_ratio, probably I missed something

gabe-l-hart · 2025-10-03T18:37:53Z

Ah, my comment was confusing. I tried at several points in history (before these review refactors, and before the most recent merge of master) and they all show the failure to stop, so I think it's something else. I was just surprised since I swear I wasn't seeing this when I was working on it a few weeks ago.

ngxson

After merging with latest master, the test model still runs well, so this should be good to merge

Any other problems can be addressed in follow-up PRs. Thanks @gabe-l-hart and the IBM team for the awesome model!

Full test result:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

gabe-l-hart · 2025-10-05T13:06:43Z

Awesome, thanks for the review! I'll look into the stopping issue soon.

gabe-l-hart · 2025-10-06T05:08:04Z

Follow up created to address stopping issue: #16438

gabe-l-hart added 10 commits September 19, 2025 13:19

feat: Add granite-docling conversion using trillion pretokenizer

64e10f5

Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add granite-docling vocab pre enum

428db16

Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Use granite-docling pre

c2202d2

Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add clip_is_idefics3

4ef3128

Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Allow multi-token boundary sequences for image templating

0aef5e9

Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Add tiling support for idefices3 in clip.cpp

8819c96

This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Fully working image preprocessing for idefics3 w/ resize and sl…

64cef62

…icing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat: Parse the preprocessor config's longest side and add it to the …

e1ba793

…mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix: Use the longest side instead of size * scale_factor

f5a7f4d

For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart requested review from CISC and ngxson as code owners September 23, 2025 18:18

github-actions bot added examples python python script changes labels Sep 23, 2025

This was referenced Sep 23, 2025

Model: GraniteDocling #16112

Closed

Granite Docling 258m ollama/ollama#12355

Open

gabe-l-hart commented Sep 23, 2025

View reviewed changes

gabe-l-hart added 2 commits September 23, 2025 12:38

fix: Allow batch encoding and remove clip_is_idefics3

cb51d4e

Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ngxson reviewed Oct 3, 2025

View reviewed changes

tools/mtmd/mtmd.cpp Outdated Show resolved Hide resolved

tools/mtmd/clip.cpp Show resolved Hide resolved

tools/mtmd/clip.cpp Outdated Show resolved Hide resolved

gabe-l-hart added 3 commits October 3, 2025 10:57

refactor: Remove unnecessary conditionals for empty token vectors

899b48a

Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

refactor: Use image_manipulation util

a966110

Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ngxson added 2 commits October 5, 2025 13:26

add test model

4be2ce9

Merge branch 'master' into GraniteDocling

72c6e67

ngxson approved these changes Oct 5, 2025

View reviewed changes

CISC approved these changes Oct 5, 2025

View reviewed changes

CISC merged commit ca71fb9 into ggml-org:master Oct 5, 2025
72 of 75 checks passed

gabe-l-hart deleted the GraniteDocling branch October 6, 2025 05:07

CISC added model Model specific hot Something that is hot labels Oct 7, 2025

Model: Granite docling + Idefics3 preprocessing (SmolVLM) #16206

Model: Granite docling + Idefics3 preprocessing (SmolVLM) #16206

Conversation

gabe-l-hart commented Sep 23, 2025

NOTES

Description

Details

Open Questions

Testing

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

CISC commented Sep 23, 2025

Uh oh!

gabe-l-hart commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Oct 3, 2025

Uh oh!

ngxson commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Oct 3, 2025

Uh oh!

ngxson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabe-l-hart commented Oct 5, 2025

Uh oh!

gabe-l-hart commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Oct 3, 2025 •

edited

Loading

ngxson left a comment •

edited

Loading