Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

akshatshah17 · 2024-12-13T10:17:30Z

Git commit

e36ad89

Operating systems

Linux

GGML backends

CPU

Problem description & steps to reproduce

I follow this procedure for build and convert the model into the quantized gguf format. But while running the model on device it is unable to load the model.

git clone https://github.com/chraac/llama.cpp.git --recursive
cd llama.cpp
git checkout dev-refactoring
export ANDROID_NDK=/home/code/Android/Ndk/android-ndk-r26d/
export QNN_SDK_PATH=/home/code/Android/qnn-sdk/qairt/2.27.5.241009/

Build for CPU
cmake -B build
cmake --build build --config Release -j16

Build for Android
cmake
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake
-DANDROID_ABI=arm64-v8a
-DANDROID_PLATFORM=android-28
-DCMAKE_C_FLAGS="-march=armv8.7a"
-DCMAKE_CXX_FLAGS="-march=armv8.7a"
-DGGML_OPENMP=OFF
-DGGML_LLAMAFILE=OFF
-DGGML_QNN=ON
-DGGML_QNN_DEFAULT_LIB_SEARCH_PATH=/data/local/tmp
-B build-android
cmake --build build-android --config Release -j4
cmake --install build-android --prefix install-android --config Release

Model conversion
python3 convert_hf_to_gguf.py ~/tiny_llama/ --outfile output_file_tiny_llama_fp32.gguf --outtype f32
./build/bin/llama-quantize output_file_tiny_llama_fp32.gguf output_file_tiny_llama_Q4_K_M.gguf Q4_K_M

On S24 QC
adb push install-android/ /data/local/tmp/
adb push output_file_tiny_llama_Q4_K_M.gguf /data/local/tmp/

export LD_LIBRARY_PATH=/data/local/tmp/install-android/lib/
./install-android/bin/llama-cli -m output_file_tiny_llama_Q4_K_M.gguf -c 512 -p "prompt"

First Bad Commit

No response

Relevant log output

build: 4396 (e36ad895) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device qnn-gpu (Qualcomm Adreno GPU) - 7630 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 273 tensors from output_file_SR_3B_Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SR_3B
llama_model_loader: - kv   3:                         general.size_label str              = 3.6B
llama_model_loader: - kv   4:                          llama.block_count u32              = 30
llama_model_loader: - kv   5:                       llama.context_length u32              = 1280
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  13:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  14:                          general.file_type u32              = 15
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 105900
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,105900]  = ["<|end_of_text|>", "<|begin_of_text|...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,105900]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,105604]  = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   61 tensors
llama_model_loader: - type q4_K:  183 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 53
llm_load_vocab: token to piece cache size = 0.6436 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 105900
llm_load_print_meta: n_merges         = 105604
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1280
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1280
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.58 B
llm_load_print_meta: model size       = 2.04 GiB (4.90 BPW)
llm_load_print_meta: general.name     = SR_3B
llm_load_print_meta: BOS token        = 1 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 0 '<|end_of_text|>'
llm_load_print_meta: UNK token        = 0 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 0 '<|end_of_text|>'
llm_load_print_meta: LF token         = 179 'Ä'
llm_load_print_meta: FIM PRE token    = 2 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 4 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 3 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 5 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 7 '<|repo_name|>'
llm_load_print_meta: EOG token        = 0 '<|end_of_text|>'
llm_load_print_meta: EOG token        = 5 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 7 '<|repo_name|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/31 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  2091.15 MiB
..................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (1280) -- the full capacity of the model will not be utilized
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 379]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 297]: failed to create QNN device
[qnn_init, 346]: why failed to initialize qnn context
[ggml_backend_qnn_init_with_device_context, 369]: init qnn subsystem failed with qnn backend qnn-npu, pls check why
llama_new_context_with_model: failed to initialize qnn-npu backend
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu
common_init_from_params: failed to create context with model 'output_file_SR_3B_Q4_K_M.gguf'
main: error: unable to load model

The text was updated successfully, but these errors were encountered:

akshatshah17 · 2025-01-03T07:11:01Z

@chraac can you please reply on this?

chraac · 2025-01-05T06:49:10Z

Hi @akshatshah17 ,
look through your error log, found that:

the initialization of npu was failed, please make sure to put the libQnnHtp*.so to the same directory along side the llama-cli, for more detail please have a look: docker-compose-compile.yml#L35
looks like you were trying to load the Q4 module, and the quantization module support of QNN is now still under construction, so maybe you should try F16/F32 model instead.

akshatshah17 · 2025-01-07T12:23:45Z

thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts

llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB
......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 299]: create QNN device successfully
[alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB
[qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB
[init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok
[init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu
llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB
llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 280.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: model was trained on only 2048 context tokens (4096 specified)

sampler seed: 3467048278
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}]
You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal

[{<(ParagraphSummary)>}]
Ingredients:

1 rotisserie chicken (or two 4 oz. Fillets), cooked
2 asparagus, chopped
2 carrots, chopped
2 bananas, sliced
2 avocados, mashed
1/4 cup cereal

Instructions:

Preheat oven to 375°F. Line a baking dish with parchment paper.
Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.
In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.
Bake for 30-35 minutes, or until cooked through.
Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]

llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second)
llama_perf_context_print: load time = 637.56 ms
llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second)
llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second)
llama_perf_context_print: total time = 5202.37 ms / 440 tokens
[ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

chraac · 2025-01-11T08:24:41Z

thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts

llm_load_print_meta: max token length = 48 llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB ...................................................................................... llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_ctx_per_seq = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow [ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default [qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported [qnn_init, 299]: create QNN device successfully [ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu [ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default [qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1 [qnn_init, 263]: deviceID:0, deviceType:0, numCores 1 [qnn_init, 268]: htp_type:0(ON_CHIP) [qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB [qnn_init, 299]: create QNN device successfully [alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB [qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB [init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok [init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type [ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: CPU compute buffer size = 280.01 MiB llama_new_context_with_model: graph nodes = 710 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 8 main: model was trained on only 2048 context tokens (4096 specified)

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | AARCH64_REPACK = 1 |

sampler seed: 3467048278 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}] You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal
[{<(ParagraphSummary)>}] Ingredients:

1 rotisserie chicken (or two 4 oz. Fillets), cooked

2 asparagus, chopped

2 carrots, chopped

2 bananas, sliced

2 avocados, mashed

1/4 cup cereal

Instructions:

Preheat oven to 375°F. Line a baking dish with parchment paper.

Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.

In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.

Bake for 30-35 minutes, or until cooked through.

Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]

llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second) llama_perf_context_print: load time = 637.56 ms llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second) llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second) llama_perf_context_print: total time = 5202.37 ms / 440 tokens [ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu [ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

From you log, looks like ites running on qnn-gpu device, btw, the llama.cpp framework will decide which device to run for each layers based on device's supports_op interface

TerryT9 · 2025-01-20T04:19:28Z

@chraac Hi, I got some error, I have follow your advicesm but I sitll got some error. Can you give me some suggestions how to debug and fix the isuue, thanks alot.

`D:\repo\platform-tools>adb shell "cd /data/local/tmp/qnn && LD_LIBRARY_PATH=/data/local/tmp/qnn/lib:/data/local/tmp/qnn/lib/aarch64-android:/data/local/tmp/qnn/install-android/lib /data/local/tmp/qnn/install-android/bin/llama-cli -m /data/local/tmp/fp16.gguf -ngl 8 -c 2048 -p 'hi'"
build: 4396 (e36ad89) with for x86_64-w64-windows-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device qnn-gpu (Qualcomm Adreno GPU) - 8435 MiB free
llama_model_loader: loaded meta data with 22 key-value pairs and 201 tensors from /data/local/tmp/fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = TinyLlama
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 1
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = TinyLlama
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/23 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: failed to initialize qnn-npu backend
common_init_from_params: failed to create context with model '/data/local/tmp/fp16.gguf'
main: error: unable to load model
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp/qnn/lib/ as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 317]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 379]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp/qnn/lib/ as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:42(SM8475), htp_arch:69(QCOM_HTP_V69), vtcm_size:8 MB
[qnn_init, 295]: qnn_device_create failed with detailed status: 0x3f0
[qnn_init, 307]: Unknown error code: 1008
[qnn_init, 315]: failed to create QNN device
[qnn_init, 364]: why failed to initialize qnn context
[ggml_backend_qnn_init_with_device_context, 369]: init qnn subsystem failed with qnn backend qnn-npu, pls check why
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu`

TerryT9 · 2025-01-20T06:34:52Z

@chraac Hello， Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

TerryT9 · 2025-01-20T09:45:03Z

@chraac Hello， Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

the qnn-gpu is working after I using Qwen model， maybe it`s model issue.

chraac · 2025-01-21T09:33:11Z

@chraac Hello， Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

the qnn-gpu is working after I using Qwen model， maybe it`s model issue.

Congrats,
from your previous log, looks you ran into the same issue that we've mention above - failed to load libQnnHtp*.so, so maybe you can try copy them from host to you phone if wanna use npu backend

Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

Please have a look at this repo: llama-cpp-qnn-builder, its docker image contains all the necessary sdks that need to build qnn.

Davidqian123 · 2025-02-05T22:49:07Z

@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device.

chraac · 2025-02-06T12:24:44Z

@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device.

Hi David, currently, our QNN backend only supports android devices. I understand there are Qualcomm devices that run Windows, and after reviewing the source code, I've identified some modifications needed for win support:

win uses different APIs from dlopen/dlclose for dynamic library loading. we'll need to create an abstraction layer and implement the win-spec version.
while I can work on the implementation and verify it compiles on my machine, but, don't have a Snapdragon laptop for testing. so would need your help verify the functionality. we could handle this work in a separate branch or issue.
created a backlog item on github project to track the work, can have a look: https://github.com/users/chraac/projects/2/views/1

myan-o · 2025-02-07T07:32:33Z

The inference speed on the CPU is optimized and very fast, so there is no noticeable difference even when using the GPU.

chraac · 2025-02-07T09:26:16Z

The inference speed on the CPU is optimized and very fast, so there is no noticeable difference even when using the GPU.

hmm, depends,

usually those devices typically have a fixed initialization overhead, which doesn't scale down proportionally for smaller models. you'll still face this overhead even with minimal processing.
for small models, the tensor dimensions are usually smaller too. this means mulmat operations might not be large enough to take full advantage of the gpu/nps's processing capabilities

may i ask which specific model you're working with?

myan-o · 2025-02-07T11:53:50Z

What do you mean by modules?

chraac · 2025-02-07T12:06:41Z

What do you mean by modules?

sry, typo, models

myan-o · 2025-02-07T12:21:10Z

ReasoningCore-3B-T1_1.f16.gguf

chraac · 2025-02-07T12:41:52Z

ReasoningCore-3B-T1_1.f16.gguf

Not tested on this model yet, but from my experience in llama3-3b, looks there aren't too much mulmat op can be offload for F16 module, cause the for gpu backend, convert op is not been supported yet,

And from the benchmark here, convert on npu is terribly slow:

Unfortunately we discovered that the conversion operations as implemented on the NPU were extremely slow, much slower than the main matrix multiplication in fact. You can see the results in the npu_quant_profile.csv file in this repository, with conversions taking over 75% of the time.

so....

hope qualcomm can improve its perf someday

Davidqian123 · 2025-02-08T01:15:31Z

@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device.

Hi David, currently, our QNN backend only supports android devices. I understand there are Qualcomm devices that run Windows, and after reviewing the source code, I've identified some modifications needed for win support:

win uses different APIs from dlopen/dlclose for dynamic library loading. we'll need to create an abstraction layer and implement the win-spec version.

while I can work on the implementation and verify it compiles on my machine, but, don't have a Snapdragon laptop for testing. so would need your help verify the functionality. we could handle this work in a separate branch or issue.

created a backlog item on github project to track the work, can have a look: https://github.com/users/chraac/projects/2/views/1

For sue, willing to help verify the functionality! I'm also deepdiving llama.cpp QNN backend support, and I'm willing to help support more ops.

chraac · 2025-02-08T02:15:43Z

For sue, willing to help verify the functionality! I'm also deepdiving llama.cpp QNN backend support, and I'm willing to help support more ops.

nice! create a new issue for it #22

chraac · 2025-02-21T03:25:50Z

hi @akshatshah17 , did you successfully run your model now? we've made many change recently, please have another try!

chraac mentioned this issue Feb 8, 2025

[feat] Port to windows #22

Closed

4 tasks

chraac self-assigned this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

akshatshah17 commented Dec 13, 2024

akshatshah17 commented Jan 3, 2025

chraac commented Jan 5, 2025

akshatshah17 commented Jan 7, 2025

chraac commented Jan 11, 2025

TerryT9 commented Jan 20, 2025

TerryT9 commented Jan 20, 2025

TerryT9 commented Jan 20, 2025

chraac commented Jan 21, 2025 •

edited

Loading

Davidqian123 commented Feb 5, 2025 •

edited

Loading

chraac commented Feb 6, 2025 •

edited

Loading

myan-o commented Feb 7, 2025

chraac commented Feb 7, 2025 •

edited

Loading

myan-o commented Feb 7, 2025 •

edited by chraac

Loading

chraac commented Feb 7, 2025

myan-o commented Feb 7, 2025

chraac commented Feb 7, 2025 •

edited

Loading

Davidqian123 commented Feb 8, 2025

chraac commented Feb 8, 2025

chraac commented Feb 21, 2025

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Comments

akshatshah17 commented Dec 13, 2024

Git commit

Operating systems

GGML backends

Problem description & steps to reproduce

First Bad Commit

Relevant log output

akshatshah17 commented Jan 3, 2025

chraac commented Jan 5, 2025

akshatshah17 commented Jan 7, 2025

chraac commented Jan 11, 2025

TerryT9 commented Jan 20, 2025

TerryT9 commented Jan 20, 2025

TerryT9 commented Jan 20, 2025

chraac commented Jan 21, 2025 • edited Loading

Davidqian123 commented Feb 5, 2025 • edited Loading

chraac commented Feb 6, 2025 • edited Loading

myan-o commented Feb 7, 2025

chraac commented Feb 7, 2025 • edited Loading

myan-o commented Feb 7, 2025 • edited by chraac Loading

chraac commented Feb 7, 2025

myan-o commented Feb 7, 2025

chraac commented Feb 7, 2025 • edited Loading

Davidqian123 commented Feb 8, 2025

chraac commented Feb 8, 2025

chraac commented Feb 21, 2025

chraac commented Jan 21, 2025 •

edited

Loading

Davidqian123 commented Feb 5, 2025 •

edited

Loading

chraac commented Feb 6, 2025 •

edited

Loading

chraac commented Feb 7, 2025 •

edited

Loading

myan-o commented Feb 7, 2025 •

edited by chraac

Loading

chraac commented Feb 7, 2025 •

edited

Loading