Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Open
wants to merge 67 commits into
base: master
Choose a base branch
from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Mar 11, 2025

  • I have read the contributing guidelines
  • Self-reported review complexity:
    * [ ] Low
    * [x] Medium
    * [ ] High
  • Testing Done
    * [x] test-backend-ops and llama-cli on Qualcomm Snapdragon 8Gen3 equipped Android phone

PR Description

this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:how to utilize the Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.

this is a concise ggml-hexagon(the previous name was ggml-qnn but that wasn't accurate) implementation:

  • follow the principle of "simple is beautiful" which comes from the great Unix in the US, code is simple and easily/quickly understand, without complex encapsulation, it's a good reference implementation of ggml-hexagon, can be extended as needs .
  • follow the principle of "make it run, then make it right, then make it fast" (run and right already got at the moment).

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),

  • data path of ggml-hexagon backend works pretty good as expected.
  • the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
  • works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers and experts be involved in further dev activity. other reason of this coding style is I think this will make the developers' workflow more easily:

  • this is a self-contained single source file(I can split to some well-organized small source files in less then 1 day if there is a strong need, I don't think this is the point at the moment: this self-contained single source file is similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp)
  • try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
  • then expand other ggml ops accordingly with team-work from AI experts and programmers in this great pure-tech community

Features

  • data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm) in my first PR on 04/2024

  • a simple and effective graph cache mechanism which already implemented on 04/2024

  • use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully

  • a simple skeleton in function ggmlqnn_compute_elementwise:offload GGML_OP_ADD & GGML_OP_MUL & GGML_OP_SUB & GGML_OP_DIV & GGML_OP_LOG & GGML_OP_SQRT to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.

  • a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MUL_MAT(2d&3d mulmat) to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.

  • a more complex skeleton in function ggml_qnn_mulmat_4d: offload 4d mulmat to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a concise implementation rather than complex C++ encapsulation with hide many tech details.(UT passed but some unknown bugs with test-backend-ops).

  • QNN NPU RPC feature which already implemented on 04/2024 (UT passed but some unknown bugs should be fixed and should be seen in all hard-forked ggml-qnn projects)

  • dynamic running parameter adjustment through ggml-qnn.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).
    Screenshot from 2025-03-22 12-39-10

  • offload quantized data type with 2d&3d mulmat to QNN backend.

  • provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.

  • [03/19/2025] the technical approach of "mapping the entire ggml computational graph to QNN graph" which already discovered on 04/02024) will be seen in another standalone PR: provide a concise implementation of the technical approach "mapping the entire ggml cgraph to a single QNN graph"( without complex/complicated encapsulation and hide tech details) although this approach might-be not the right solution in llama.cpp.

  • [03/22/2025] provide a very fast approach which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl or Huawei's ggml-cann: offload ggml op to Hexagon cDSP directly.

  • code is simple and everyone can understand code easily and quickly, without complex encapsulation and hide tech details, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.

special clarification in this section:

  • all original tech comes from Qualcomm, Qualcomm provide the fundamental mechanism and we programmer use it regardless of coding style or tech approach.

Performance of ggml-hexagon backend

test phone is a Snapdragon 8 Gen3 Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. all fp32 and quantize type mulmat already offload to QNN-NPU backend in this PR:

  1. general approach through QNN
    before finetune:
    default backend: 25.41 tokens / second
    Hexagon NPU backend: prompt eval 3.10 tokens / second, eval performance 11.47 tokens / second
ggml-qnn-perforamnce-default-and-npu.mp4

after finetune in local dev envs:
default backend: 43.47 tokens / second
Hexagon NPU backend: prompt eval 3.47 tokens / second, eval performance 22.03 tokens / second

ggml-qnn-performance-afterfinetune-default-and-npu.mp4
  1. special approach through QNN(mapping the entire cgraph to a single QNN graph, skipped because it's exactly equivalent to above at the moment)
  2. general approach through Hexagon cDSP which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl
    there is an unknown issue with mulmat on cDSP, so only offload GGML_OP_ADD to cDSP at the moment. this PR should complete it's final mission if it can still achieve the same effect after offload mulmat to cDSP. I'm working on mulmat kernel on cDSP.
ggmlqnn-hexagon-cdsp-only-add.mp4

Hexagon NPU performance with qwen1_5-1_8b-chat-q4_0.gguf on Snapdragon 8 Gen3

03/11/2025:prompt eval 3.10 tokens / second, eval performance 11.47 tokens / second
03/12/2025:prompt eval 3.47 tokens / second, eval performance 22.03 tokens / second
03/13/2025:prompt eval 4.34 tokens / second, eval performance 23.72 tokens / second

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE:

  • utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically(pls see below section)

  • you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

    SM8450 (Snapdragon 8 Gen 1+)
    SM8550 (Snapdragon 8 Gen 2)
    SM8650 (Snapdragon 8 Gen 3)
    SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)

  git clone https://github.com/kantv-ai/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench 

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-hexagon" to help troubleshooting work.

How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device

the good news for WoA port is:

  • a Snapdragon 8gen2 or 8gen3 or 8 Elite(aka 8gen4) equipped Android phone can be seen or bought everywhere.
  • WoA port can be done in another standalone PR by a skilled Windows programmer because the highly-well designed Qualcomm QNN SDK and the source codes of ggml/llama.cpp are both highly portable.

Big picture of ggml-hexagon backend

there are three technical approaches of Hexagon NPU inference on Qualcomm's latest mobile and desktop SoC:

  • the general approach through QNN which similar to Intel sycl or Qualcomm opencl or Huawei cann, this approach can be seen in this PR and the Hexagon NPU performance is really bad.
  • the general approach through Hexagon cDSP which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl or Huawei's ggml-cann, this approach can be seen in this PR.
  • the special approach through QNN: mapping the entire ggml computational graph to a single QNN graph. this approach will be seen in another standalone PR. the Hexagon NPU performance in this approach is not good or also really bad, I don't know why at the moment because "mapping the entire computational graph to a single QNN graph" might-be or should-be or just-be Qualcomm's key-point of "utilize the Hexagon NPU maximally through QNN-NPU backend" in Qualcomm's all AI software stacks:

qualcomm-qnn-sdk

  • [updated on 03/20/2025]: I deeply thought many hours after a senior staff technical expert from Qualcomm told me on 03/18/2025 that "QNN is not the right solution here" very valuablely, today I think I know there is another tech approach of "utilize the Hexagon NPU maximally". this PR can also get approval regardless of this third tech approach of "utilize the Hexagon NPU maximally" because I'll try to implement the third tech approach base on this PR(in other words, most of codes in this PR will be re-used in the third tech approach, AND the efforts on the first tech approach or the second tech approach is also meaningful because these are all necessary exploring steps before completing the final mission) if my guess can be confirmed by the senior staff technical expert at Qualcomm: I think I know how do that so-called third approach and I think I completely understand why there is so much performance difference between ggml-hexagon and Intel's ggml-sycl or Huawei's ggml-cann at the moment if my guess can be confirmed.
  • [updated on 03/22/2025]: the general approach through Hexagon cDSP which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl can be seen in this PR.

key-points about ggml-hexagon's performance in the general approach:

  • load/performance loss of data transfer between AP(arm cpu) and NPU(dsp). that's the performance loss caused by transferring data between the main CPU and the NPU. this part requires redesigning the data structure in the ggml-hexagon implementation(in other words, shared buffer or memory pool should-be used in the implementation of ggml-hexagon), placing all tensor data entirely in the DSP's device memory to minimize data copying or ideally achieve zero-copy.
  • relative tricks with Qualcomm's QNN SDK in the tech approach through QNN, this is a highly-well designed SDK at the same time I personally think it's usage is really not easy.
  • relative tricks with Qualcomm's Hexagon cDSP in the tech approach through Hexagon cDSP, this is a straight way and exactly similar to Intel's ggml-sycl.
  • we need to write some "hexagon kernels" in the general approach through Hexagon cDSP, which is similar to Qualcomm's ggml-opencl(opencl kernel) or other backend(cuda kernel), or which is exactly similar to TEE TA & CA: we can clearly see here hexagon kernels(libggmlop-skel.so) is similar to a specified TEE TA on TEE OS and ggml-hexagon is similar to a specified TEE CA on AP(arm-cpu), the difference is that TEE GP API is an international sw standard and here is Qualcomm's dedicated sw stack.
  • some ops that are generally critical to inference performance in ggml-hexagon, AI experts must be involved in the rest parts of ggml-hexagon, we only need to implement some performance-sensitive ggml ops in the general approach(through QNN or through Hexagon cDSP).

key-points about ggml-hexagon's peformance in the special approach:

  • Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work in this approach: compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph, this is the most important key-point in this approach.
  • we must implement ALL ggml ops in this approach and the general approach through QNN is the essential foundation of this approach.

How to compose an ideal single QNN graph from a ggml cgraph

as well known, the real "standout" and excellent or great llama.cpp will compose an ideal complete/entire ggml cgraph from any supported LLM model through the exceptional/great ggml machine learning library.

as I explained in section "Big picture of ggml-hexagon backend", there is a key-point in the so-called second technical approach in ggml-hexagon backend: compose an ideal single QNN graph from an entire/complete ggml cgraph.

in this section, I'll detailed illustrate how to compose an ideal single QNN graph from a ggml cgraph which contains multiple nodes and generated from my self-made tool, this idea or mechanism can be extended to compose an ideal single QNN graph from a complete/entire ggml cgraph of a specified LLM model. I personally think computer scientists or AI experts must be involved in the rest parts of this work because there are might-be/should-be many complex graph theories or algorithms in the rest part of this work, one more thing, my self-made graph algorithm seems too naive and low-performance although the compute result of the single QNN graph is correct for 15+ cases, I think the original author of llama.cpp or AI experts can help to optimize the complex graph algorithm here. btw, the approach in this section is just my logic/thinking way which already explained in section "PR description":try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT.

  • case 1: compose an ideal single QNN graph from a ggml cgraph which contains multiple GGML_OP_ADD nodes
    • construct a dedicated ggml cgraph manually
    dst0 = ggml_add(ctx, src0, src1);
    dst1 = ggml_add(ctx, src2, src3);
    dst  = ggml_add(ctx, dst0, dst1);
    
    • convert to a single QNN graph manually
         p_tensor0 = ggmlqnn_create_compute_tensor(instance, graph_handle, src0,
                                                   QNN_TENSOR_TYPE_APP_WRITE);
         p_tensor1 = ggmlqnn_create_compute_tensor(instance, graph_handle, src1,
                                                   QNN_TENSOR_TYPE_APP_WRITE);
         p_tensor2 = ggmlqnn_create_compute_tensor(instance, graph_handle, dst0,
                                                   QNN_TENSOR_TYPE_APP_READ);
         Qnn_Tensor_t tensor_inputs0[] = {
                 *p_tensor0,
                 *p_tensor1
         };
         Qnn_Tensor_t tensor_outputs0[] = {
                 *p_tensor2
         };
         Qnn_OpConfig_t op_config0 = {
                 QNN_OPCONFIG_VERSION_1, {
                         "add1",
                         QNN_OP_PACKAGE_NAME_QTI_AISW,
                         QNN_OP_ELEMENT_WISE_ADD,
                         0,
                         nullptr,
                         2,
                         tensor_inputs0,
                         1,
                         tensor_outputs0
                 }
         };
         CHECK_QNN_API(qnn_error, qnn_raw_interface.graphAddNode(graph_handle, op_config0));
         QNN_VER_PTR(*p_tensor0)->clientBuf = {src0->data, ggmlqnn_get_tensor_data_size(src0)};
         QNN_VER_PTR(*p_tensor1)->clientBuf = {src1->data, ggmlqnn_get_tensor_data_size(src1)};
         QNN_VER_PTR(*p_tensor2)->clientBuf = {dst0->data, ggmlqnn_get_tensor_data_size(dst0)};
    
         p_tensor3 = ggmlqnn_create_compute_tensor(instance, graph_handle, src3,
                                                   QNN_TENSOR_TYPE_APP_WRITE);
         p_tensor4 = ggmlqnn_create_compute_tensor(instance, graph_handle, src4,
                                                   QNN_TENSOR_TYPE_APP_WRITE);
         p_tensor5 = ggmlqnn_create_compute_tensor(instance, graph_handle, dst1,
                                                   QNN_TENSOR_TYPE_APP_READ);
         Qnn_Tensor_t tensor_inputs1[] = {
                 *p_tensor3,
                 *p_tensor4
         };
         Qnn_Tensor_t tensor_outputs1[] = {
                 *p_tensor5
         };
         Qnn_OpConfig_t op_config1 = {
                 QNN_OPCONFIG_VERSION_1, {
                         "add2",
                         QNN_OP_PACKAGE_NAME_QTI_AISW,
                         QNN_OP_ELEMENT_WISE_ADD,
                         0,
                         nullptr,
                         2,
                         tensor_inputs1,
                         1,
                         tensor_outputs1
                 }
         };
         CHECK_QNN_API(qnn_error, qnn_raw_interface.graphAddNode(graph_handle, op_config1));
         QNN_VER_PTR(*p_tensor3)->clientBuf = {src3->data, ggmlqnn_get_tensor_data_size(src3)};
         QNN_VER_PTR(*p_tensor4)->clientBuf = {src4->data, ggmlqnn_get_tensor_data_size(src4)};
         QNN_VER_PTR(*p_tensor5)->clientBuf = {dst1->data, ggmlqnn_get_tensor_data_size(dst1)};
    
    
         p_tensor6 = ggmlqnn_create_compute_tensor(instance, graph_handle, dst2,
                                                   QNN_TENSOR_TYPE_APP_READ);
         Qnn_Tensor_t tensor_inputs2[] = {
                 *p_tensor2,
                 *p_tensor5
         };
         Qnn_Tensor_t tensor_outputs2[] = {
                 *p_tensor6
         };
         Qnn_OpConfig_t op_config2 = {
                 QNN_OPCONFIG_VERSION_1, {
                         "add3",
                         QNN_OP_PACKAGE_NAME_QTI_AISW,
                         QNN_OP_ELEMENT_WISE_ADD,
                         0,
                         nullptr,
                         2,
                         tensor_inputs2,
                         1,
                         tensor_outputs2
                 }
         };
         CHECK_QNN_API(qnn_error, qnn_raw_interface.graphAddNode(graph_handle, op_config2));
         QNN_VER_PTR(*p_tensor6)->clientBuf = {dst2->data, ggmlqnn_get_tensor_data_size(dst2)};
         CHECK_QNN_API(qnn_error, qnn_raw_interface.graphFinalize(graph_handle, nullptr, nullptr));
    
         Qnn_Tensor_t real_tensor_inputs[] = {
                 *p_tensor0,
                 *p_tensor1,
                 *p_tensor3,
                 *p_tensor4,
         };
         Qnn_Tensor_t real_tensor_outputs[] = {
                 *p_tensor2,
                 *p_tensor5,
                 *p_tensor6
         };
         CHECK_QNN_API(qnn_error, qnn_raw_interface.graphExecute(graph_handle,
                                                                 real_tensor_inputs, 4,
                                                                 real_tensor_outputs, 3,
                                                                 nullptr, nullptr));
    

the key-point in this simple case is that we should convert a multi nodes ggml cgraph to QNN graph correctly as the quoted codes(graph -> graph). the principle of case 1 should-be/just-be a general approach of compose a single QNN graph from a ggml cgraph which contains multiple nodes(I'm not tech expert at Qualcomm, so I use should-be/just-be here. btw, I think this principle should be suitable for Intel's ggml-sycl or Huawei's ggml-cann).

  • case 2: compose an ideal single QNN graph from a ggml cgraph which contains multiple GGML_OP_ADD and GGML_OP_MUL and GGML_OP_MUL_MAT nodes

    • construct a dedicated ggml cgraph manually(which similar to what AI experts did in llama.cpp)
  dst0 = ggml_mul_mat(ctx, src0, src1);
  dst1 = ggml_add(ctx, src2, src3);
  dst2 = ggml_mul(ctx, src4, src5);
  dst  = ggml_mul_mat(ctx, dst0, dst1);
  dst  = ggml_add(ctx, dst, dst2);

  • run this ggml cgraph with the default ggml backend for purpose of cross validation
    Screenshot from 2025-03-17 18-33-53

    • run this ggml cgraph with the QNN-CPU backend(equivalent to QNN-NPU backend)
      Screenshot from 2025-03-17 18-31-01
    • observe logs of adb logcat
      Screenshot from 2025-03-17 18-32-02

the above test case is also not a very complex case(compare to a real LLM model), so the current graph algorithm can cover it and therefore the calculation results of the QNN-CPU/QNN-NPU backend are correct.

in the all, now we know how to mapping a multiple nodes ggml cgraph to a single QNN graph(I personally think the original authors of llama.cpp or other AI experts should understand the core principle very quickly in this section). there are various scenarios/combinations(or graph DAG) in general case(for example, mapping a entire ggml cgraph to a single QNN map) and the compute result depends on the graph algorithm: the QNN backend's calculation result will fail if the graph algorithm cannot cover the specified graph DAG, this is the reason why I personally think computer scientists or AI experts must be involved in this step for a high-performance graph algorithm.

Acknowledgement

  1. the implementation through QNN is mainly porting/reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm). all the original techs of this topic comes from Qualcomm.
  2. I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
  3. I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
  4. thanks for that I borrowed 7-8 functions from implementation which comes from a CN programmer chraac 's team. one of these functions is very helpful although it can be also directly used in the original PR(04/2024 or 05/2024 or before I disappeared in this great tech community last year) accordingly rather than complex encapsulation.
  5. thanks for this post:Bug: MinGW build fails to load models with "error loading model: PrefetchVirtualMemory unavailable" #9311; the original tech of this customized/dedicated toolchain llvm-mingw-20250305-ggml-ucrt-x86_64.zip comes from https://github.com/mstorsjo/llvm-mingw/releases; the git.exe and cmake.exe and ninja.exe comes from MS's VS2022. the purpose of this customized toolchain is to make workflow easy for Linux programmer.
  6. thanks for that I got a kind help during my difficult effort on WoA(Windows on ARM) build from @ejrydhfs and a senior&staff tech leader&engineer @AndreasKunar, at the same time, I also got a meaningful help from https://github.com/Windows-on-ARM-Experiments/mingw-woarm64-build which comes from some excellent MS's compiler&toolchain engineers.
  7. thanks for the standout/excellent/great maintainers&original authors of llama.cpp,I learnt so much from their codes.
  8. thanks for @zhuipiaochen because your unintentional/casual & kind help helped me to complete the final puzzle and now I has a more clear understanding of implementation through QNN for ggml-hexagon.
  9. thanks so much for a senior staff technical expert @max-krasnyansky from Qualcomm HQ whom give an important/valuable/breakthrough guidance on direction.
  10. recently I tried AI-assisted programming for ggml-hexagon with the help from the powerful Grok 3,it really helped me a lot in this PR.

Conclusion

after spent too much efforts on ggml-hexagon backend, I personally think:

  • AI experts must be involved in the rest parts of ggml-hexagon regardless of coding style or tech approach.
  • a fully open-source implementation of ggml-hexagon backend might be a team-work between experienced programmers and AI experts even the professional technical help from Qualcomm regardless of coding style or tech approach.
  • tech difficulties and performance issues should be completely same to this PR even with complicated and cool C++ encapsulation because all tech approaches can be implemented in this PR or base on this PR.
  • it's a real functional PR: can pass the test-backend-ops, can do LLM inference with Hexagon NPU on a Snapdragon 8 Gen3 phone, the Hexagon NPU performance is good(dsp optimization is not an easy/quick work), but lack of other ops(other necessary ggml ops can be added by AI experts), can be a good starting point of ggml-hexagon in this community.
  • I wish Qualcomm's expert can support my third formal PR if they don't have plans to release an official ggml-hexagon PR in the future, then AI experts and other developers around the world could have a chance to engage it and improve for the performance and other things:
    • we can see ggml-sycl ggml-cann ggml-opencl is still under refactoring or improvement and they are both comes from the top world-class IT company with many domain tech experts. in other words, there might be no perfect PR.
    • this PR's style is same to the original ggml/llama.cpp: simple is beautiful and without complex/complicated encapsulation although the core maintainers are both genius programmers and C++ masters.
    • I think some design tricks from FFmpeg or GStreamer might-be/already used in GGML's backend subsystem: there are more than 1 backend implementation for the same hardware accelerator.

Screenshot from 2025-03-19 08-39-58

  • the general approach through Hexagon cDSP which similar to Intel ggml-sycl or Qualcomm ggml-opencl or Huawei ggml-cann) SHOULD BE the P0 team-work task(some other performance-sensitive ggml op functions must be implemented by team-work between AI experts and experienced programmers).
  • the Intel's ggml-sycl might-be fully utilized for Qualcomm's Hexagon NPU if Qualcomm's engineering team can do some adaption work(this is really not easy so efforts on this PR is meaningful). what the Intel's sycl want to do is that sycl provide an uniform software stack/framework for heterogeneous multi-core computing in embedded system or desktop system. this is another key-point I got after I finished the general approach of "how to utilize the Hexagon NPU maximally" through Hexagon cDSP.

@github-actions github-actions bot added build Compilation issues script Script related ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2025
@github-actions github-actions bot added the testing Everything test related label Mar 11, 2025
@zhouwg zhouwg force-pushed the pr_to_upstream branch 5 times, most recently from 3ef106e to db890cc Compare March 11, 2025 09:09
@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 11, 2025

why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend

  1. general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample
    https://github.com/kantv-ai/kantv/blob/01a49bb2d3dc963d920ce7b6827524c8e6c0a70a/core/ggml/qnnsample/QnnSampleMain.cpp#L434-L484
    we can clearly see that there is a prebuilt binary module file(xxxxx.so) which generated by Qualcomm's dedicated tool(they called it as "qnn-pytorch-converter" or "qnn-tensorflow-converter" or "qnn-tflite-converter"), this binary module file can be converted from a complicated C++ source file which is also generated by Qualcomm's dedicated tool. an example of this very complicated C++ source file as following:
    https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp

the key-point function in this complicated C++ source file is:
https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp#L20634

we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.

  1. approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's Genie(Generative AI Inference Extensions) software stack from Qualcomm's latest QNN SDK(2.32.0.250228, as of 03/11/2025)

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html

Screenshot from 2025-02-28 19-05-50

Screenshot from 2025-03-12 08-10-17

Screenshot from 2025-03-12 08-21-17

we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.

  1. approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in XiaoMi StableDiffusionOnDevice

we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???

  1. approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in PowerInfer

this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:

the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????

  1. inference procedure in the existing implementation of llama.cpp
    Screenshot from 2025-03-11 18-30-59
    we can clearly see that the inference procedure in ggml-sycl is a typical skeleton of all existing ggml backends. accordingly, there is a similar code snippet in this PR(ggml-qnn backend):
    Screenshot from 2025-03-11 18-34-46
    Screenshot from 2025-03-11 19-44-17

ok, let me doing an interesting experiment with ggml-qnn backend in this PR:

  • uncomment line 3665 and line 3666 in function ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)
  • modify configuration file <llama.cpp path>/scripts/ggml-qnn.cfg from
    Screenshot from 2025-03-11 18-41-35
    to
    Screenshot from 2025-03-11 18-42-46
  • running the script ./scripts/build-run-android.sh run_llamacli 2 accordingly(the meaning of this command is to launch LLM inference on the QNN NPU backend)

what we can see from the logs of adb logcat?

we can clearly see that there is no such an entire or complete GGML graph in this function:
Screenshot from 2025-03-11 19-44-17

accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp.

conclusion:

  • there is NO second technical approach in ggml-qnn backend because of limitation of the existing implementation of llama.cpp.
  • tech approach in this PR is a general approach and step in all ggml backends regardless coding style.

[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR.

I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech.

@Dampfinchen
Copy link

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

thanks for your kind comment.

  1. Quacomm's Hexagon NPU support is really huge work for this project although now we clearly know the principle or know what, because Qualcomm provides some binary dedicated tools to do LLM model conversion in their dedicated AI sw stacks and some other closed-source implementation also use this similar approach exactly. so programmers must compose an ideal QNN graph according to the complete ggml cgraph manually in ggml-qnn backend if they use/chose the second tech approach in ggml-qnn backend("mapping the complete ggml cgraph to a single opcfg QNN graph"). there are 800+ cgraph nodes and 50+ ops in qwen1_5-1_8b-chat-q4_0.gguf, accordingly, "(Hexgon) NPU support is huge for this project", real AI experts must be involved in the rest parts of ggml-qnn.
  2. I think I can make it(ggml-exynos or ggml-samsung) work on Exynos 2200 if I can get a necessary phone(I can try to buy it) and SDK&tech docs(this might-be not easy because of strict IPR policy in some big IT companys as my personal understanding at the moment) and follow the principle "make it run and then make it right and finally make it fast",this is one of my areas of expertise.

zhouwg

This comment was marked as resolved.

@zhouwg zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 0065122 to 1f702df Compare March 16, 2025 08:12
zhouwg

This comment was marked as resolved.

@zhouwg zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 1e98561 to e4b0d8c Compare March 16, 2025 09:51
@zhouwg zhouwg force-pushed the pr_to_upstream branch 3 times, most recently from 967be44 to a26806a Compare March 18, 2025 03:34
zhouwg

This comment was marked as resolved.

@zhouwg zhouwg force-pushed the pr_to_upstream branch 7 times, most recently from e662f59 to f8d56f4 Compare March 22, 2025 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning script Script related testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants