-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14
Comments
@chraac can you please reply on this? |
Hi @akshatshah17 ,
|
thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts llm_load_print_meta: max token length = 48 [qnn_init, 248]: device property is not supported [qnn_init, 258]: device counts 1 system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | AARCH64_REPACK = 1 | sampler seed: 3467048278 [{<(Task)>}] [{<(ParagraphSummary)>}]
Instructions:
llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second) |
From you log, looks like ites running on |
@chraac Hi, I got some error, I have follow your advicesm but I sitll got some error. Can you give me some suggestions how to debug and fix the isuue, thanks alot. `D:\repo\platform-tools>adb shell "cd /data/local/tmp/qnn && LD_LIBRARY_PATH=/data/local/tmp/qnn/lib:/data/local/tmp/qnn/lib/aarch64-android:/data/local/tmp/qnn/install-android/lib /data/local/tmp/qnn/install-android/bin/llama-cli -m /data/local/tmp/fp16.gguf -ngl 8 -c 2048 -p 'hi'" [qnn_init, 248]: device property is not supported [qnn_init, 258]: device counts 1 |
@chraac Hello, Do we have an easy way to build llama.cpp with qnn-npu banckend. Thansk a lot. |
the qnn-gpu is working after I using Qwen model, maybe it`s model issue. |
Congrats,
Please have a look at this repo: llama-cpp-qnn-builder, its docker image contains all the necessary sdks that need to build qnn. |
@chraac If it is possible to build llama.cpp QNN backend for laptop? I have Snapdragon X Elite laptop chip which has NPU. Now I check the CMakeLists.txt in ggml-qnn folder and I found QNN backend build only supports Android device. |
Hi David, currently, our QNN backend only supports android devices. I understand there are Qualcomm devices that run Windows, and after reviewing the source code, I've identified some modifications needed for win support:
|
The inference speed on the CPU is optimized and very fast, so there is no noticeable difference even when using the GPU. |
hmm, depends,
may i ask which specific model you're working with? |
What do you mean by modules? |
sry, typo, models |
ReasoningCore-3B-T1_1.f16.gguf |
Not tested on this model yet, but from my experience in llama3-3b, looks there aren't too much mulmat op can be offload for F16 module, cause the for gpu backend, convert op is not been supported yet, And from the benchmark here, convert on npu is terribly slow:
so.... hope qualcomm can improve its perf someday |
For sue, willing to help verify the functionality! I'm also deepdiving llama.cpp QNN backend support, and I'm willing to help support more ops. |
nice! create a new issue for it #22 |
hi @akshatshah17 , did you successfully run your model now? we've made many change recently, please have another try! |
Git commit
e36ad89
Operating systems
Linux
GGML backends
CPU
Problem description & steps to reproduce
I follow this procedure for build and convert the model into the quantized gguf format. But while running the model on device it is unable to load the model.
git clone https://github.com/chraac/llama.cpp.git --recursive
cd llama.cpp
git checkout dev-refactoring
export ANDROID_NDK=/home/code/Android/Ndk/android-ndk-r26d/
export QNN_SDK_PATH=/home/code/Android/qnn-sdk/qairt/2.27.5.241009/
Build for CPU
cmake -B build
cmake --build build --config Release -j16
Build for Android
cmake
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake
-DANDROID_ABI=arm64-v8a
-DANDROID_PLATFORM=android-28
-DCMAKE_C_FLAGS="-march=armv8.7a"
-DCMAKE_CXX_FLAGS="-march=armv8.7a"
-DGGML_OPENMP=OFF
-DGGML_LLAMAFILE=OFF
-DGGML_QNN=ON
-DGGML_QNN_DEFAULT_LIB_SEARCH_PATH=/data/local/tmp
-B build-android
cmake --build build-android --config Release -j4
cmake --install build-android --prefix install-android --config Release
Model conversion
python3 convert_hf_to_gguf.py ~/tiny_llama/ --outfile output_file_tiny_llama_fp32.gguf --outtype f32
./build/bin/llama-quantize output_file_tiny_llama_fp32.gguf output_file_tiny_llama_Q4_K_M.gguf Q4_K_M
On S24 QC
adb push install-android/ /data/local/tmp/
adb push output_file_tiny_llama_Q4_K_M.gguf /data/local/tmp/
export LD_LIBRARY_PATH=/data/local/tmp/install-android/lib/
./install-android/bin/llama-cli -m output_file_tiny_llama_Q4_K_M.gguf -c 512 -p "prompt"
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: