Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

Closed
stduhpf opened this issue Dec 15, 2024 · 23 comments
Closed

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

stduhpf opened this issue Dec 15, 2024 · 23 comments

Comments

@stduhpf
Copy link
Contributor

stduhpf commented Dec 15, 2024

Name and Version

.\build\bin\Release\llama-cli.exe --version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4329 (89d604f)
built with MSVC 19.41.34120.0 for x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 5900X +RX 5700 XT

Models

Qwen2-VL-7B-Instruct-IQ4_NL + mmproj-Qwen2-VL-7B-Instruct-f32

Problem description & steps to reproduce

When I run it on Vulkan build, the description given by the model has nothing to do with the image given as argument (no matter the -ngl value, even -ngl 0 is broken). The exact same setup works perfectly fine on CPU backend.

I know the Vulkan backend doesn't support Qwen2-VL yet, but according to #10361 (comment), this should only cause slowdowns, not invalid outputs.

Relevant log output

Image input:

Untitled

-ngl 0

> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in   843.10 ms
encode_image_with_clip: all 1 segments encoded in   843.17 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in   845.06 ms by CLIP (    2.34 ms per image patch)

The image shows a person wearing a black and white striped shirt, a black jacket, and black pants, standing in front of a black background. The person is also holding a black and white striped umbrella. The context of this image could be a fashion or clothing advertisement, showcasing the person's outfit and accessories. The black and white striped shirt, jacket, and umbrella create a monochromatic look, which is often used in fashion photography to emphasize the clothing and accessories. The black background helps to highlight the person and their outfit, making them the focal point of the image.
llama_perf_context_print:        load time =    6644.91 ms
llama_perf_context_print: prompt eval time =    2276.84 ms /   391 tokens (    5.82 ms per token,   171.73 tokens per second)
llama_perf_context_print:        eval time =   11500.85 ms /   115 runs   (  100.01 ms per token,    10.00 tokens per second)
llama_perf_context_print:       total time =   18275.28 ms /   506 tokens

-ngl 99

> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99
[...]
encode_image_with_clip: step 1 of 1 encoded in  3248.68 ms
encode_image_with_clip: all 1 segments encoded in  3248.76 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  3249.79 ms by CLIP (    9.00 ms per image patch)

The image appears to be a logo or a symbol, but it is not clear what it represents. It could be a brand logo, a company logo, or a symbol for a specific organization or group. Without additional context or information, it is difficult to determine the exact meaning or purpose of the image.
llama_perf_context_print:        load time =    9346.17 ms
llama_perf_context_print: prompt eval time =    1009.47 ms /   391 tokens (    2.58 ms per token,   387.33 tokens per second)
llama_perf_context_print:        eval time =    1500.12 ms /    61 runs   (   24.59 ms per token,    40.66 tokens per second)
llama_perf_context_print:       total time =   10889.94 ms /   452 tokens

CPU backend for comparison

> .\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in  8483.38 ms
encode_image_with_clip: all 1 segments encoded in  8483.47 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  8484.85 ms by CLIP (   23.50 ms per image patch)

The image appears to be a simple text-based graphic with the words "READABLE TEXT" written in a bold, black font. The context of this image could be related to demonstrating or emphasizing the importance of clear and legible text, possibly in the context of design, typography, or user interface (UI) design. It might be used to highlight the importance of making text easy to read and understand for users.
llama_perf_context_print:        load time =   21741.16 ms
llama_perf_context_print: prompt eval time =   10924.92 ms /   391 tokens (   27.94 ms per token,    35.79 tokens per second)
llama_perf_context_print:        eval time =    8322.39 ms /    83 runs   (  100.27 ms per token,     9.97 tokens per second)
llama_perf_context_print:       total time =   30185.33 ms /   474 tokens
@ggerganov
Copy link
Member

Could you do a quick test and see if it works with an F16 vision projector:

.\build\bin\Release\llama-quantize.exe .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf f16

.\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

It's not working :(

.\build\bin\Release\llama-quantize.exe  .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf  .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf f16
main: build = 4333 (a0974156)
main: built with MSVC 19.41.34120.0 for x64
main: quantizing '.\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf' to '.\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf' as F16
llama_model_loader: loaded meta data with 20 key-value pairs and 521 tensors from .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = clip
llama_model_loader: - kv   1:                        general.description str              = image encoder for Qwen2VL
llama_model_loader: - kv   2:                          general.file_type u32              = 0
llama_model_loader: - kv   3:                      clip.has_text_encoder bool             = false
llama_model_loader: - kv   4:                    clip.has_vision_encoder bool             = true
llama_model_loader: - kv   5:                    clip.has_qwen2vl_merger bool             = true
llama_model_loader: - kv   6:                        clip.projector_type str              = qwen2vl_merger
llama_model_loader: - kv   7:                              clip.use_silu bool             = false
llama_model_loader: - kv   8:                              clip.use_gelu bool             = false
llama_model_loader: - kv   9:                     clip.vision.patch_size u32              = 14
llama_model_loader: - kv  10:                     clip.vision.image_size u32              = 560
llama_model_loader: - kv  11:               clip.vision.embedding_length u32              = 1280
llama_model_loader: - kv  12:                 clip.vision.projection_dim u32              = 3584
llama_model_loader: - kv  13:           clip.vision.attention.head_count u32              = 16
llama_model_loader: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                    clip.vision.block_count u32              = 32
llama_model_loader: - kv  16:            clip.vision.feed_forward_length u32              = 0
llama_model_loader: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
llama_model_loader: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
llama_model_loader: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
llama_model_loader: - type  f32:  521 tensors
llama_model_quantize: failed to quantize: unknown model architecture: 'clip'
main: failed to quantize model from '.\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf'

stable-diffusion.cpp's cli does allow me convert it to f16, but I think its strips off important metadata:

.\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
build: 4333 (a0974156) with MSVC 19.41.34120.0 for x64
llama_model_loader: loaded meta data with 37 key-value pairs and 339 tensors from .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 VL 7B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-7B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv  14:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  15:                   qwen2vl.embedding_length u32              = 3584
llama_model_loader: - kv  16:                qwen2vl.feed_forward_length u32              = 18944
llama_model_loader: - kv  17:               qwen2vl.attention.head_count u32              = 28
llama_model_loader: - kv  18:            qwen2vl.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 25
llama_model_loader: - kv  22:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-7B-Instruct-GGUF...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:   28 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:  169 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ4_NL - 4.5 bpw
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.13 GiB (4.66 BPW)
llm_load_print_meta: general.name     = Qwen2 VL 7B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  4226.55 MiB
.....................................................................................
key general.file_type not found in file

@ggerganov
Copy link
Member

Ah, I think you have to use the surgery script:

python ./examples/llava/qwen2_vl_surgery.py Qwen/Qwen2-VL-2B-Instruct --data_type fp16

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

It's the same mmproj for the 2b and the 7B model?

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

It's the same mmproj for the 2b and the 7B model?

It seems not

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

Could you do a quick test and see if it works with an F16 vision projector:

>.\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-2B-Instruct-Q8_0.gguf --mmproj .\qwen-qwen2-vl-2b-instruct-vision.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
clip_model_load: model name:   Qwen/Qwen2-VL-2B-Instruct
clip_model_load: description:  image encoder for Qwen2VL
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    521
clip_model_load: n_kv:         20
clip_model_load: ftype:        f16
[...]

CPU:

The image shows the text "READABLE TEXT." This text is likely used to indicate that the content or information presented is easy to read and understand. It could be used in various contexts such as a website, a document, or a presentation where the goal is to make the information accessible to a wide audience.

Vulkan (ngl 99):

The image appears to be a stylized representation of a person wearing a hat and a coat. The hat and coat are the main focus, and the background is a simple, minimalistic design. The context of this image could be related to a fashion advertisement, a promotional poster, or a branding image. The hat and coat might be part of a collection or a series of items, such as a hat and coat set, a fashion line, or a brand identity.

Still not working

@jeffbolznv
Copy link
Collaborator

Can you try enabling GGML_VULKAN_CHECK_RESULTS and see if it identifies the broken op? You might need to manually add the cpu backend source files to ggml-vulkan (I think this broke when the backends were refactored).

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

Can you try enabling GGML_VULKAN_CHECK_RESULTS and see if it identifies the broken op? You might need to manually add the cpu backend source files to ggml-vulkan (I think this broke when the backends were refactored).

ggml-vulkan.obj : error LNK2019: unresolved external symbol ggml_graph_compute_with_ctx referenced in function "void __cdecl ggml_vk_check_results_0(struct ggml_tensor *)" (?ggml_vk_check_results_0@@YAXPEAUggml_tensor@@@Z) [C:\llama.cpp\buildv\ggml\src\ggml-vulkan\ggml-vulkan.vcxproj] C:\llama.cpp\buildv\bin\Release\ggml-vulkan.dll : fatal error LNK1120: 1 unresolved externals [C:\llama.cpp\buildv\ggml\src\ggml-vulkan\ggml-vulkan.vcxproj]

@jeffbolznv
Copy link
Collaborator

To fix those linker issues you need to add the ggml-cpu sources to ggml-vulkan.

@slaren
Copy link
Member

slaren commented Dec 15, 2024

Building with -DBUILD_SHARED_LIBS=OFF should also work.

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

Can you try enabling GGML_VULKAN_CHECK_RESULTS and see if it identifies the broken op? You might need to manually add the cpu backend source files to ggml-vulkan (I think this broke when the backends were refactored).

1 node_0 op=IM2COL avg_err=0
2 node_3 op=MUL_MAT avg_err=0.00111936
3  (reshaped) (permuted) (cont) op=CONT avg_err=0
4 node_7 op=IM2COL avg_err=0
5 node_10 op=MUL_MAT avg_err=0.00109479
6  (reshaped) (permuted) (cont) op=CONT avg_err=0
7 node_14 op=ADD avg_err=0
8  (permuted) (cont) op=CONT avg_err=0
9  (permuted) (cont) (reshaped) (reshaped) (permuted) (cont) op=CONT avg_err=0
10 node_22 op=NORM avg_err=3.37601e-09
11 node_23 op=MUL avg_err=0
12 node_24 op=ADD avg_err=0
13 node_25 op=MUL_MAT avg_err=0.000117832
14 node_26 op=ADD avg_err=0
15  (reshaped) (permuted) (cont) op=CONT avg_err=0
16 node_31 op=MUL_MAT avg_err=0.0010295
17 node_32 op=ADD avg_err=0
C:\llama.cpp\ggml\src\ggml.c:3513: GGML_ASSERT(a->ne[2] == b->ne[0]) failed

@LostRuins
Copy link
Collaborator

LostRuins commented Dec 16, 2024

I can confirm this issue happens even with no layers offloaded. On CPU backend it works fine.

Model is BF16, projector F16. Same assert as above.

@cyzero-kim
Copy link
Contributor

cyzero-kim commented Dec 22, 2024

It’s a slightly different model, but it works well with MobileVLM, which uses CLIP. It doesn’t seem to be an issue with CLIP itself.

C:\work\llm\cyzero\llama.cpp\build\bin\Release> .\llama-llava-cli.exe -m C:\work\llm\MobileVLM_V2-1.7B-GGUF\ggml-model-q4_k.gguf --mmproj C:\work\llm\MobileVLM_V2-1.7B-GGUF\mmproj-model-f16.gguf --image C:\work\llm\4.png -ngl 20 -p "what brand name, you can see?"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: 4330 (b5ae1ddf) with MSVC 19.35.32217.1 for x64
llama_load_model_from_file: using device Vulkan0 (Intel(R) Iris(R) Xe Graphics) - 16163 MiB free
llama_model_loader: loaded meta data with 23 key-value pairs and 219 tensors from C:\work\llm\MobileVLM_V2-1.7B-GGUF\ggml-model-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Work
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   5:                          llama.block_count u32              = 24
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   49 tensors
llama_model_loader: - type q4_K:  162 tensors
llama_model_loader: - type q5_K:    7 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 1.36 B
llm_load_print_meta: model size       = 754.43 MiB (4.64 BPW)
llm_load_print_meta: general.name     = Work
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Compiling shaders..........................Done!
llm_load_tensors: offloading 20 repeating layers to GPU
llm_load_tensors: offloaded 20/25 layers to GPU
llm_load_tensors:      Vulkan0 model buffer size =   551.56 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   754.43 MiB
...........................................................................................
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    379
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 379 tensors from C:\work\llm\MobileVLM_V2-1.7B-GGUF\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = ldpv2
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  143 tensors
clip_model_load: CLIP using Vulkan backend
key clip.use_silu not found in file
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     567.51 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: params backend buffer size =  567.51 MB (379 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
llama_kv_cache_init:    Vulkan0 KV buffer size =   640.00 MiB
llama_kv_cache_init:        CPU KV buffer size =   128.00 MiB
llama_new_context_with_model: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =   186.25 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 48 (with bs=512), 3 (with bs=1)
encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in   890.28 ms by CLIP (    6.18 ms per image patch)

 The image features a black background with white text. The text reads "readable text", indicating a focus on the readability of the text. The text is written in an all-caps format, suggesting that it may be written in a non-traditional or serif font, which is sometimes seen in more modern digital writing. The text is centered, making it the main point of interest in the image. The image does not contain any other objects or elements, and the text is the only source of information. The overall impression is one of simplicity and focus on the text itself.
llama_perf_context_print:        load time =    9943.68 ms
llama_perf_context_print: prompt eval time =    2789.78 ms /   191 tokens (   14.61 ms per token,    68.46 tokens per second)
llama_perf_context_print:        eval time =    6256.90 ms /   119 runs   (   52.58 ms per token,    19.02 tokens per second)
llama_perf_context_print:       total time =   16289.22 ms /   310 tokens

"The image features a black background with white text. The text reads "readable text", indicating a focus on the readability of the text. The text is written in an all-caps format, suggesting that it may be written in a non-traditional or serif font, which is sometimes seen in more modern digital writing. The text is centered, making it the main point of interest in the image. The image does not contain any other objects or elements, and the text is the only source of information. The overall impression is one of simplicity and focus on the text itself."

@LostRuins
Copy link
Collaborator

LostRuins commented Dec 22, 2024

Running clip on CPU solves this issue, the main model can still be kept on GPU. Possibly related to #10896 (that is a workaround, not a fix)

@jeffbolznv
Copy link
Collaborator

FWIW this works for me at top of tree, on RTX 4070/Windows.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jan 5, 2025

Looks like it's working now.

@stduhpf stduhpf closed this as completed Jan 5, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Jan 5, 2025

Looks like it's working now.

If you got the time, it might be interesting to bisect the repo to figure out which commit fixed it, cause I don't think it was intentional.

@LostRuins
Copy link
Collaborator

Doesn't seem to be working for me.

@stduhpf are you sure you are running CLIP on GPU? It was disabled in #10896 and it has not been re-enabled since then. Please ensure it's re-enabled, and then test again.

@stduhpf
Copy link
Contributor Author

stduhpf commented Jan 6, 2025

Doesn't seem to be working for me.

@stduhpf are you sure you are running CLIP on GPU? It was disabled in #10896 and it has not been re-enabled since then. Please ensure it's re-enabled, and then test again.

clip_model_load: CLIP using CPU backend

How do I re-enable it? I'm on a fresh build (-DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=OFF -DGGML_RPC=ON -DGGML_VULKAN_CHECK_RESULTS=OFF) at commit e6e7c75

Edit: nevermind I'm stupid

@stduhpf
Copy link
Contributor Author

stduhpf commented Jan 6, 2025

Ok so it's indeed still broken with clip on Vulkan. Should I reopen?

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 6, 2025

Yeah, keep it open while it's not fixed.

@jeffbolznv
Copy link
Collaborator

I verified I can repro with that other PR reverted. Looks like the clip code always executes the graph on the Vulkan backend even for ops that aren't supported. I guess that's why the GPU backends were disabled?

@jeffbolznv
Copy link
Collaborator

This was fixed by #11902.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants