Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Open
akshatshah17 opened this issue Dec 13, 2024 · 19 comments
Open
Assignees

Comments

@akshatshah17
Copy link

Git commit

e36ad89

Operating systems

Linux

GGML backends

CPU

Problem description & steps to reproduce

I follow this procedure for build and convert the model into the quantized gguf format. But while running the model on device it is unable to load the model.

git clone https://github.com/chraac/llama.cpp.git --recursive
cd llama.cpp
git checkout dev-refactoring
export ANDROID_NDK=/home/code/Android/Ndk/android-ndk-r26d/
export QNN_SDK_PATH=/home/code/Android/qnn-sdk/qairt/2.27.5.241009/

Build for CPU
cmake -B build
cmake --build build --config Release -j16

Build for Android
cmake
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake
-DANDROID_ABI=arm64-v8a
-DANDROID_PLATFORM=android-28
-DCMAKE_C_FLAGS="-march=armv8.7a"
-DCMAKE_CXX_FLAGS="-march=armv8.7a"
-DGGML_OPENMP=OFF
-DGGML_LLAMAFILE=OFF
-DGGML_QNN=ON
-DGGML_QNN_DEFAULT_LIB_SEARCH_PATH=/data/local/tmp
-B build-android
cmake --build build-android --config Release -j4
cmake --install build-android --prefix install-android --config Release

Model conversion
python3 convert_hf_to_gguf.py ~/tiny_llama/ --outfile output_file_tiny_llama_fp32.gguf --outtype f32
./build/bin/llama-quantize output_file_tiny_llama_fp32.gguf output_file_tiny_llama_Q4_K_M.gguf Q4_K_M

On S24 QC
adb push install-android/ /data/local/tmp/
adb push output_file_tiny_llama_Q4_K_M.gguf /data/local/tmp/

export LD_LIBRARY_PATH=/data/local/tmp/install-android/lib/
./install-android/bin/llama-cli -m output_file_tiny_llama_Q4_K_M.gguf -c 512 -p "prompt"

First Bad Commit

No response

Relevant log output

build: 4396 (e36ad895) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device qnn-gpu (Qualcomm Adreno GPU) - 7630 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 273 tensors from output_file_SR_3B_Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SR_3B
llama_model_loader: - kv   3:                         general.size_label str              = 3.6B
llama_model_loader: - kv   4:                          llama.block_count u32              = 30
llama_model_loader: - kv   5:                       llama.context_length u32              = 1280
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  13:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  14:                          general.file_type u32              = 15
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 105900
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,105900]  = ["<|end_of_text|>", "<|begin_of_text|...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,105900]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,105604]  = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   61 tensors
llama_model_loader: - type q4_K:  183 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 53
llm_load_vocab: token to piece cache size = 0.6436 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 105900
llm_load_print_meta: n_merges         = 105604
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1280
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1280
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.58 B
llm_load_print_meta: model size       = 2.04 GiB (4.90 BPW)
llm_load_print_meta: general.name     = SR_3B
llm_load_print_meta: BOS token        = 1 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 0 '<|end_of_text|>'
llm_load_print_meta: UNK token        = 0 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 0 '<|end_of_text|>'
llm_load_print_meta: LF token         = 179 'Ä'
llm_load_print_meta: FIM PRE token    = 2 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 4 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 3 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 5 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 7 '<|repo_name|>'
llm_load_print_meta: EOG token        = 0 '<|end_of_text|>'
llm_load_print_meta: EOG token        = 5 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 7 '<|repo_name|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/31 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  2091.15 MiB
..................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (1280) -- the full capacity of the model will not be utilized
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 379]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 297]: failed to create QNN device
[qnn_init, 346]: why failed to initialize qnn context
[ggml_backend_qnn_init_with_device_context, 369]: init qnn subsystem failed with qnn backend qnn-npu, pls check why
llama_new_context_with_model: failed to initialize qnn-npu backend
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu
common_init_from_params: failed to create context with model 'output_file_SR_3B_Q4_K_M.gguf'
main: error: unable to load model
@akshatshah17
Copy link
Author

@chraac can you please reply on this?

@chraac
Copy link
Owner

chraac commented Jan 5, 2025

Hi @akshatshah17 ,
look through your error log, found that:

  1. the initialization of npu was failed, please make sure to put the libQnnHtp*.so to the same directory along side the llama-cli, for more detail please have a look: docker-compose-compile.yml#L35
  2. looks like you were trying to load the Q4 module, and the quantization module support of QNN is now still under construction, so maybe you should try F16/F32 model instead.

@akshatshah17
Copy link
Author

thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts

llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 23/23 layers to GPU

llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB
......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 299]: create QNN device successfully
[alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB
[qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB
[init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok
[init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu

llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB
llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 280.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: model was trained on only 2048 context tokens (4096 specified)

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | AARCH64_REPACK = 1 |

sampler seed: 3467048278
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}]
You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal

[{<(ParagraphSummary)>}]
Ingredients:

  • 1 rotisserie chicken (or two 4 oz. Fillets), cooked
  • 2 asparagus, chopped
  • 2 carrots, chopped
  • 2 bananas, sliced
  • 2 avocados, mashed
  • 1/4 cup cereal

Instructions:

  1. Preheat oven to 375°F. Line a baking dish with parchment paper.
  2. Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.
  3. In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.
  4. Bake for 30-35 minutes, or until cooked through.
  5. Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]

llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second)
llama_perf_context_print: load time = 637.56 ms
llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second)
llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second)
llama_perf_context_print: total time = 5202.37 ms / 440 tokens
[ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

@chraac
Copy link
Owner

chraac commented Jan 11, 2025

thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts

llm_load_print_meta: max token length = 48 llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB ...................................................................................... llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_ctx_per_seq = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow [ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default [qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported [qnn_init, 299]: create QNN device successfully [ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu [ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default [qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1 [qnn_init, 263]: deviceID:0, deviceType:0, numCores 1 [qnn_init, 268]: htp_type:0(ON_CHIP) [qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB [qnn_init, 299]: create QNN device successfully [alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB [qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB [init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok [init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type [ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: CPU compute buffer size = 280.01 MiB llama_new_context_with_model: graph nodes = 710 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 8 main: model was trained on only 2048 context tokens (4096 specified)

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | AARCH64_REPACK = 1 |

sampler seed: 3467048278 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}] You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal
[{<(ParagraphSummary)>}] Ingredients:

  • 1 rotisserie chicken (or two 4 oz. Fillets), cooked
  • 2 asparagus, chopped
  • 2 carrots, chopped
  • 2 bananas, sliced
  • 2 avocados, mashed
  • 1/4 cup cereal

Instructions:

  1. Preheat oven to 375°F. Line a baking dish with parchment paper.
  2. Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.
  3. In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.
  4. Bake for 30-35 minutes, or until cooked through.
  5. Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]

llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second) llama_perf_context_print: load time = 637.56 ms llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second) llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second) llama_perf_context_print: total time = 5202.37 ms / 440 tokens [ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu [ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

From you log, looks like ites running on qnn-gpu device, btw, the llama.cpp framework will decide which device to run for each layers based on device's supports_op interface

@TerryT9
Copy link

TerryT9 commented Jan 20, 2025

@chraac Hi, I got some error, I have follow your advicesm but I sitll got some error. Can you give me some suggestions how to debug and fix the isuue, thanks alot.

`D:\repo\platform-tools>adb shell "cd /data/local/tmp/qnn && LD_LIBRARY_PATH=/data/local/tmp/qnn/lib:/data/local/tmp/qnn/lib/aarch64-android:/data/local/tmp/qnn/install-android/lib /data/local/tmp/qnn/install-android/bin/llama-cli -m /data/local/tmp/fp16.gguf -ngl 8 -c 2048 -p 'hi'"
build: 4396 (e36ad89) with for x86_64-w64-windows-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device qnn-gpu (Qualcomm Adreno GPU) - 8435 MiB free
llama_model_loader: loaded meta data with 22 key-value pairs and 201 tensors from /data/local/tmp/fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = TinyLlama
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 1
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = TinyLlama
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 ''
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/23 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: failed to initialize qnn-npu backend
common_init_from_params: failed to create context with model '/data/local/tmp/fp16.gguf'
main: error: unable to load model
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp/qnn/lib/ as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 317]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 379]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp/qnn/lib/ as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:42(SM8475), htp_arch:69(QCOM_HTP_V69), vtcm_size:8 MB
[qnn_init, 295]: qnn_device_create failed with detailed status: 0x3f0
[qnn_init, 307]: Unknown error code: 1008
[qnn_init, 315]: failed to create QNN device
[qnn_init, 364]: why failed to initialize qnn context
[ggml_backend_qnn_init_with_device_context, 369]: init qnn subsystem failed with qnn backend qnn-npu, pls check why
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu`

@TerryT9
Copy link

TerryT9 commented Jan 20, 2025

@chraac Hello, Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

@TerryT9
Copy link

TerryT9 commented Jan 20, 2025

@chraac Hello, Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

the qnn-gpu is working after I using Qwen model, maybe it`s model issue.

@chraac
Copy link
Owner

chraac commented Jan 21, 2025

@chraac Hello, Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

the qnn-gpu is working after I using Qwen model, maybe it`s model issue.

Congrats,
from your previous log, looks you ran into the same issue that we've mention above - failed to load libQnnHtp*.so, so maybe you can try copy them from host to you phone if wanna use npu backend

Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot.

Please have a look at this repo: llama-cpp-qnn-builder, its docker image contains all the necessary sdks that need to build qnn.

@Davidqian123
Copy link

Davidqian123 commented Feb 5, 2025

@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device.

@chraac
Copy link
Owner

chraac commented Feb 6, 2025

@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device.

Hi David, currently, our QNN backend only supports android devices. I understand there are Qualcomm devices that run Windows, and after reviewing the source code, I've identified some modifications needed for win support:

  1. win uses different APIs from dlopen/dlclose for dynamic library loading. we'll need to create an abstraction layer and implement the win-spec version.
  2. while I can work on the implementation and verify it compiles on my machine, but, don't have a Snapdragon laptop for testing. so would need your help verify the functionality. we could handle this work in a separate branch or issue.
  3. created a backlog item on github project to track the work, can have a look: https://github.com/users/chraac/projects/2/views/1

@myan-o
Copy link

myan-o commented Feb 7, 2025

The inference speed on the CPU is optimized and very fast, so there is no noticeable difference even when using the GPU.

@chraac
Copy link
Owner

chraac commented Feb 7, 2025

The inference speed on the CPU is optimized and very fast, so there is no noticeable difference even when using the GPU.

hmm, depends,

  1. usually those devices typically have a fixed initialization overhead, which doesn't scale down proportionally for smaller models. you'll still face this overhead even with minimal processing.
  2. for small models, the tensor dimensions are usually smaller too. this means mulmat operations might not be large enough to take full advantage of the gpu/nps's processing capabilities

may i ask which specific model you're working with?

@myan-o
Copy link

myan-o commented Feb 7, 2025

What do you mean by modules?

@chraac
Copy link
Owner

chraac commented Feb 7, 2025

What do you mean by modules?

sry, typo, models

@myan-o
Copy link

myan-o commented Feb 7, 2025

ReasoningCore-3B-T1_1.f16.gguf

@chraac
Copy link
Owner

chraac commented Feb 7, 2025

ReasoningCore-3B-T1_1.f16.gguf

Not tested on this model yet, but from my experience in llama3-3b, looks there aren't too much mulmat op can be offload for F16 module, cause the for gpu backend, convert op is not been supported yet,

And from the benchmark here, convert on npu is terribly slow:

Unfortunately we discovered that the conversion operations as implemented on the NPU were extremely slow, much slower than the main matrix multiplication in fact. You can see the results in the npu_quant_profile.csv file in this repository, with conversions taking over 75% of the time.

so....

hope qualcomm can improve its perf someday

@Davidqian123
Copy link

@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device.

Hi David, currently, our QNN backend only supports android devices. I understand there are Qualcomm devices that run Windows, and after reviewing the source code, I've identified some modifications needed for win support:

  1. win uses different APIs from dlopen/dlclose for dynamic library loading. we'll need to create an abstraction layer and implement the win-spec version.
  2. while I can work on the implementation and verify it compiles on my machine, but, don't have a Snapdragon laptop for testing. so would need your help verify the functionality. we could handle this work in a separate branch or issue.
  3. created a backlog item on github project to track the work, can have a look: https://github.com/users/chraac/projects/2/views/1

For sue, willing to help verify the functionality! I'm also deepdiving llama.cpp QNN backend support, and I'm willing to help support more ops.

@chraac chraac mentioned this issue Feb 8, 2025
4 tasks
@chraac
Copy link
Owner

chraac commented Feb 8, 2025

For sue, willing to help verify the functionality! I'm also deepdiving llama.cpp QNN backend support, and I'm willing to help support more ops.

nice! create a new issue for it #22

@chraac chraac self-assigned this Feb 21, 2025
@chraac
Copy link
Owner

chraac commented Feb 21, 2025

hi @akshatshah17 , did you successfully run your model now? we've made many change recently, please have another try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants