-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326
base: master
Are you sure you want to change the base?
Conversation
3ef106e
to
db890cc
Compare
why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend
the key-point function in this complicated C++ source file is: we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???
this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:
the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????
ok, let me doing an interesting experiment with ggml-qnn backend in this PR:
what we can see from the logs of adb logcat? we can clearly see that there is no such an entire or complete GGML graph in this function: accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp. conclusion:
[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR. I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech. |
2ceaaf5
to
3402e2c
Compare
Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs? |
thanks for your kind comment.
|
0065122
to
1f702df
Compare
1e98561
to
e4b0d8c
Compare
967be44
to
a26806a
Compare
…sion also both ok in step13
… to Grok 3's style
e662f59
to
f8d56f4
Compare
* [ ] Low
* [x] Medium
* [ ] High
* [x]
test-backend-ops
andllama-cli
on Qualcomm Snapdragon 8Gen3 equipped Android phonePR Description
this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:how to utilize the Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.
this is a concise ggml-hexagon(the previous name was ggml-qnn but that wasn't accurate) implementation:
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature and the maturation of test-backend-ops),
this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers and experts be involved in further dev activity. other reason of this coding style is I think this will make the developers' workflow more easily:
Features
data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the implementation through QNN in executorch comes from Qualcomm) in my first PR on 04/2024
a simple and effective graph cache mechanism which already implemented on 04/2024
use a simple STL to manage QNN resources in this PR rather than complex C++ encapsulation because the highly-well designed QNN SDK already manage it's internal hardware and software resource very carefully
a simple skeleton in function ggmlqnn_compute_elementwise:offload GGML_OP_ADD & GGML_OP_MUL & GGML_OP_SUB & GGML_OP_DIV & GGML_OP_LOG & GGML_OP_SQRT to QNN backend. we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MUL_MAT(2d&3d mulmat) to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a very concise implementation rather than complex C++ encapsulation with hide many tech details.
a more complex skeleton in function ggml_qnn_mulmat_4d: offload 4d mulmat to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally". we can see this function is a concise implementation rather than complex C++ encapsulation with hide many tech details.(UT passed but some unknown bugs with test-backend-ops).
QNN NPU RPC feature which already implemented on 04/2024 (UT passed but some unknown bugs should be fixed and should be seen in all hard-forked ggml-qnn projects)
dynamic running parameter adjustment through ggml-qnn.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).

offload quantized data type with 2d&3d mulmat to QNN backend.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
[03/19/2025] the technical approach of "mapping the entire ggml computational graph to QNN graph" which already discovered on 04/02024) will be seen in another standalone PR: provide a concise implementation of the technical approach "mapping the entire ggml cgraph to a single QNN graph"( without complex/complicated encapsulation and hide tech details) although this approach might-be not the right solution in llama.cpp.
[03/22/2025] provide a very fast approach which exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl or Huawei's ggml-cann: offload ggml op to Hexagon cDSP directly.
code is simple and everyone can understand code easily and quickly, without complex encapsulation and hide tech details, because layered abstraction and loose coupling will bring difficult with code tracking and troubleshooting.
special clarification in this section:
Performance of ggml-hexagon backend
test phone is a Snapdragon 8 Gen3 Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. all fp32 and quantize type mulmat already offload to QNN-NPU backend in this PR:
before finetune:
default backend: 25.41 tokens / second
Hexagon NPU backend: prompt eval 3.10 tokens / second, eval performance 11.47 tokens / second
ggml-qnn-perforamnce-default-and-npu.mp4
after finetune in local dev envs:
default backend: 43.47 tokens / second
Hexagon NPU backend: prompt eval 3.47 tokens / second, eval performance 22.03 tokens / second
ggml-qnn-performance-afterfinetune-default-and-npu.mp4
there is an unknown issue with mulmat on cDSP, so only offload GGML_OP_ADD to cDSP at the moment. this PR should complete it's final mission if it can still achieve the same effect after offload mulmat to cDSP. I'm working on mulmat kernel on cDSP.
ggmlqnn-hexagon-cdsp-only-add.mp4
Hexagon NPU performance with qwen1_5-1_8b-chat-q4_0.gguf on Snapdragon 8 Gen3
03/11/2025:prompt eval 3.10 tokens / second, eval performance 11.47 tokens / second
03/12/2025:prompt eval 3.47 tokens / second, eval performance 22.03 tokens / second
03/13/2025:prompt eval 4.34 tokens / second, eval performance 23.72 tokens / second
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions might be also ok). the dev activity in this PR can be done in pure command line without any IDE:
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically(pls see below section)
you will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-hexagon" to help troubleshooting work.
How to build ggml‐hexagon source code for Snapdragon based WoA(Windows on ARM) device
the good news for WoA port is:
Big picture of ggml-hexagon backend
there are three technical approaches of Hexagon NPU inference on Qualcomm's latest mobile and desktop SoC:
key-points about ggml-hexagon's performance in the general approach:
key-points about ggml-hexagon's peformance in the special approach:
How to compose an ideal single QNN graph from a ggml cgraph
as well known, the real "standout" and excellent or great llama.cpp will compose an ideal complete/entire ggml cgraph from any supported LLM model through the exceptional/great ggml machine learning library.
as I explained in section "Big picture of ggml-hexagon backend", there is a key-point in the so-called second technical approach in ggml-hexagon backend: compose an ideal single QNN graph from an entire/complete ggml cgraph.
in this section, I'll detailed illustrate how to compose an ideal single QNN graph from a ggml cgraph which contains multiple nodes and generated from my self-made tool, this idea or mechanism can be extended to compose an ideal single QNN graph from a complete/entire ggml cgraph of a specified LLM model. I personally think computer scientists or AI experts must be involved in the rest parts of this work because there are might-be/should-be many complex graph theories or algorithms in the rest part of this work, one more thing, my self-made graph algorithm seems too naive and low-performance although the compute result of the single QNN graph is correct for 15+ cases, I think the original author of llama.cpp or AI experts can help to optimize the complex graph algorithm here. btw, the approach in this section is just my logic/thinking way which already explained in section "PR description":try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT.
the key-point in this simple case is that we should convert a multi nodes ggml cgraph to QNN graph correctly as the quoted codes(graph -> graph). the principle of case 1 should-be/just-be a general approach of compose a single QNN graph from a ggml cgraph which contains multiple nodes(I'm not tech expert at Qualcomm, so I use should-be/just-be here. btw, I think this principle should be suitable for Intel's ggml-sycl or Huawei's ggml-cann).
case 2: compose an ideal single QNN graph from a ggml cgraph which contains multiple GGML_OP_ADD and GGML_OP_MUL and GGML_OP_MUL_MAT nodes
run this ggml cgraph with the default ggml backend for purpose of cross validation

the above test case is also not a very complex case(compare to a real LLM model), so the current graph algorithm can cover it and therefore the calculation results of the QNN-CPU/QNN-NPU backend are correct.
in the all, now we know how to mapping a multiple nodes ggml cgraph to a single QNN graph(I personally think the original authors of llama.cpp or other AI experts should understand the core principle very quickly in this section). there are various scenarios/combinations(or graph DAG) in general case(for example, mapping a entire ggml cgraph to a single QNN map) and the compute result depends on the graph algorithm: the QNN backend's calculation result will fail if the graph algorithm cannot cover the specified graph DAG, this is the reason why I personally think computer scientists or AI experts must be involved in this step for a high-performance graph algorithm.
Acknowledgement
Conclusion
after spent too much efforts on ggml-hexagon backend, I personally think: