-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Model: Granite docling + Idefics3 preprocessing (SmolVLM) #16206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…icing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
LLAMA_VOCAB_PRE_TYPE_KIMI_K2 = 37, | ||
LLAMA_VOCAB_PRE_TYPE_HUNYUAN_DENSE = 38, | ||
LLAMA_VOCAB_PRE_TYPE_GROK_2 = 39, | ||
LLAMA_VOCAB_PRE_TYPE_GRANITE_DOCLING = 40, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Us and our long names 😞. I assume vertical alignment is worth preserving, but happy to not touch the other lines if preferred.
|
||
// vision-specific | ||
#define KEY_IMAGE_SIZE "clip.vision.image_size" | ||
#define KEY_PREPROC_IMAGE_SIZE "clip.vision.preproc_image_size" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't totally sure the right name for this one since it comes from preprocessor_config.size.longest_edge
. This seemed appropriate since it's the size of the image for the first step of preprocessing, but I'm up for other names too (preproc_longest_edge
?)
tools/mtmd/clip.cpp
Outdated
// multiples of image_size (always rounding up) | ||
// | ||
// CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics3/image_processing_idefics3.py#L737 | ||
const float scale = std::min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is similar to llava_uhd::get_slice_instructions
, but it's different enough that I opted not to put this there for simplicity. I could certainly move it if that would be better though.
tools/mtmd/mtmd.cpp
Outdated
if (clip_is_llava(ctx_clip) | ||
|| clip_is_minicpmv(ctx_clip) | ||
|| clip_is_glm(ctx_clip) | ||
|| clip_is_idefics3(ctx_clip)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't fully clear whether this was correct to limit to non-batch encoding, but since other tile-based models fall in this category, I figured it was correct. If we don't need this, we could also remove clip_is_idefics3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing this and things appear to work, so I've removed this part. I haven't tested robustly for concurrent inference, though.
llama_token tok_sli_img_end = LLAMA_TOKEN_NULL; // single slice end | ||
llama_token tok_sli_img_mid = LLAMA_TOKEN_NULL; // between 2 slices | ||
llama_token tok_row_end = LLAMA_TOKEN_NULL; // end of row | ||
std::vector<llama_token> tok_ov_img_start; // overview image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overview image prefix is multiple tokens long for idefics3
. It seemed likely that other models might have this for other delimiter tokens as well, so since they're all getting converted to std::vector<llama_token>
to pass to add_text
anyway, I figured it was more extensible to make all of these std::vector<llama_token>
. It also avoids the need to check the sentinel value LLAMA_TOKEN_NULL
.
bool use_mrope = false; // for Qwen2VL, we need to use M-RoPE | ||
|
||
// string template for slice image delimiters with row/col (idefics3) | ||
std::string sli_img_start_tmpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, there was no way to create the <row_R_col_C>
delimiters since the row and col values need to be passed as inputs. I opted to use a basic printf
style template here. One alternate approach would be to make this a nullable function handle that could be implemented with a lambda
for individual models. That might be slightly more efficient to avoid the double snprintf
below, but I'm not sure how that would trade off with the overhead of managing the function pointer and using std::string
concatenation methods.
Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master: (39 commits) ci : disable AMD workflows + update NVIDIA workflows (ggml-org#16200) ci : enable Vulkan workflow on Mac (ggml-org#16194) ggml-cpu: Respect cpumask settings (ggml-org#16164) ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (ggml-org#15928) zdnn: refactor codebase + add docs (ggml-org#16178) codeowners : add @danbev to model-conversion example [no ci] (ggml-org#16190) devops: add s390x containers (ggml-org#15915) ggml-cpu : fix typo in gemm comments [no ci] (ggml-org#16189) feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (ggml-org#16177) clang-tidy : disable warning about performance enum size (ggml-org#16127) ggml : implement set_rows with i32 index (ggml-org#16159) codeowners : update + cleanup (ggml-org#16174) common : enable `--offline` mode without curl support (ggml-org#16137) webui : fix handling incomplete chunks (ggml-org#16107) embedding : fix typos in README (ggml-org#16171) common : remove unused local variables (ggml-org#16140) ggml : extend ggml_can_fuse to work with non-sequential nodes (ggml-org#16123) ggml : add ggml_op_is_empty (ggml-org#16122) codeowners : update ownership for @ngxson and @allozuar (ggml-org#16128) Vulkan: add conv_transpose_2d operation (ggml-org#16022) ...
Generally LGTM, but I'll have to defer to @ngxson for |
@ngxson A gentle nudge on reviewing this (without disrupting your vacation, of course!). The model has been sitting in the top-ten trending on HF (hitting #1 a few days ago), so there's a lot of community interest in getting this fully supported everywhere, especially for embedded inference with |
* origin/master: (124 commits) metal : fix loop bound in ggml_mem_ranges (ggml-org#16412) llama : fix shapes for bert/mpt q/k norm (ggml-org#16409) ggml : fix graph reallocation with multiple chunks (ggml-org#16396) Fix missing messages on sibling navigation (ggml-org#16408) vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (ggml-org#16354) vulkan: Fix FA coopmat1 invalid array indexing (ggml-org#16365) ci : change macos-13 to macos-15-intel (ggml-org#16401) Capture model name only after first token (streaming) or completed request (ggml-org#16405) vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (ggml-org#16316) webui : Fix messages payload sent to chat completions (ggml-org#16402) fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (ggml-org#16356) test-barrier : do not use more threads than physically available (ggml-org#16389) ggml webgpu: add support for soft_max, optimize rms_norm (ggml-org#16357) model : Apertus model implementation (ggml-org#15852) musa: update compile flags (ggml-org#16265) ci : fix ubuntu-latest-cmake-rpc (disable ccache) (ggml-org#16388) ci: update vulkan ci (ggml-org#16294) ci : fix clean-up of old logs (ggml-org#16381) SYCL: Update to oneAPI 2025.2 (ggml-org#16371) HIP: add IMbackK to codeowner (ggml-org#16375) ...
Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@ngxson Thanks for the review! I've addressed the comments. Strangely, now I'm consistently seeing the image fail to terminate generation. Applying these same fixes to SmolVLM produces much better results than without and does not fail to terminate, so there seems to be something about |
Consider the changes don't touch token placements, it's quite surprising that we have such error. Did you also tried with top_k=1 ? I haven't examined your code closely but maybe it's worth comparing its results againts calc_size_preserved_ratio, probably I missed something |
Ah, my comment was confusing. I tried at several points in history (before these review refactors, and before the most recent merge of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After merging with latest master, the test model still runs well, so this should be good to merge
Any other problems can be addressed in follow-up PRs. Thanks @gabe-l-hart and the IBM team for the awesome model!
Full test result:
OK: llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK: llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK: llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK: llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK: llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK: llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
OK: llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK: llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK: llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK: llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK: llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
OK: llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
OK: llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
OK: llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
OK: llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S
Awesome, thanks for the review! I'll look into the stopping issue soon. |
Follow up created to address stopping issue: #16438 |
NOTES
This PR replaces #16112 since that one was based on a branch in the core repo and I no longer have push access to the branch after #16113.
closes #16110
Description
This PR adds model support for https://huggingface.co/ibm-granite/granite-docling-258M. There are two key parts to the support:
convert_hf_to_gguf.py
and the adjacent scripts/libraries to support the modelidefics3
Details
The current code simply resizes the input to a square with padding. This is very different than the preprocessing implementation in transformers. From what I can tell, support for the more nuanced tile-based preprocessing was lost during the transition from
llava
tomtmd
.The logic in
transformers
is:preprocessor_config.size.longest_edge
(ifdo_resize
here)image_size
(ifdo_image_splitting
here)image_size x image_size
whereN = width / image_size
andM = height / image_size
image_size x image_size
and added to the list of patches as the<global-img>
<fake_token_around_image><row_R_col_C>{image tokens}
\n
is added to the end\n<fake_token_around_image><global-img>{image tokens}<fake_token_around_image>
Open Questions
In
transformers
, there are conditionals that have default values frompreprocessor_config
fordo_resize
anddo_image_splitting
. These can be overwritten withkwargs
at invocation time. For both Granite Docling and SmolVLM, the defaults aretrue
for both values, and the models give demonstrably bad results if the above tiling scheme is not followed. As such, I've opted to not make these configurable either in thehparams
or as input tomtmd_encode
. One alternate approach would be to decouple the preprocessing schemes from the models and give models default preprocessing schemes while allowing alternate schemes to be chosen at runtime. Given the poor results with the wrong preprocessing, though, I don't think this is worth the effort.Testing